Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Multimodal Behavior Analysis in the Wild: Advances and Challenges
Multimodal Behavior Analysis in the Wild: Advances and Challenges
Multimodal Behavior Analysis in the Wild: Advances and Challenges
Ebook941 pages9 hours

Multimodal Behavior Analysis in the Wild: Advances and Challenges

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Multimodal Behavioral Analysis in the Wild: Advances and Challenges presents the state-of- the-art in behavioral signal processing using different data modalities, with a special focus on identifying the strengths and limitations of current technologies. The book focuses on audio and video modalities, while also emphasizing emerging modalities, such as accelerometer or proximity data. It covers tasks at different levels of complexity, from low level (speaker detection, sensorimotor links, source separation), through middle level (conversational group detection, addresser and addressee identification), and high level (personality and emotion recognition), providing insights on how to exploit inter-level and intra-level links.

This is a valuable resource on the state-of-the- art and future research challenges of multi-modal behavioral analysis in the wild. It is suitable for researchers and graduate students in the fields of computer vision, audio processing, pattern recognition, machine learning and social signal processing.

  • Gives a comprehensive collection of information on the state-of-the-art, limitations, and challenges associated with extracting behavioral cues from real-world scenarios
  • Presents numerous applications on how different behavioral cues have been successfully extracted from different data sources
  • Provides a wide variety of methodologies used to extract behavioral cues from multi-modal data
LanguageEnglish
Release dateNov 13, 2018
ISBN9780128146026
Multimodal Behavior Analysis in the Wild: Advances and Challenges

Related to Multimodal Behavior Analysis in the Wild

Related ebooks

Robotics For You

View More

Related articles

Reviews for Multimodal Behavior Analysis in the Wild

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Multimodal Behavior Analysis in the Wild - Xavier Alameda-Pineda

    IAPR.

    Multimodal behavior analysis in the wild: An introduction

    Xavier Alameda-Pineda⁎; Elisa Ricci†,‡; Nicu Sebe†    ⁎Inria Grenoble Rhone-Alpes, Perception Team, France

    †University of Trento, Department of Information Engineering and Computer Science, Italy

    ‡Fondazione Bruno Kessler, Technology of Vision, Italy

    Abstract

    The nature of human behavior is complex and multifaceted. Behavioral expressions vary significantly across individuals and are influenced by many factors. People act differently according to their physical and mental state, to their age, gender and socio-cultural background, to the nature of the tasks they are engaged in, to the environment where they operate, to the behavior of other individuals, etc. All these factors make the automatic analysis of human behavior an extremely challenging problem.

    Despite its complexity, human behavior understanding has attracted considerable attention due to its many applications, e.g., in health care, conflict and people management, sociology, marketing and surveillance, etc. In the last decades many researchers have invested efforts into developing computational approaches that enable one to automatically describe the behavior of individuals and groups. Generally speaking, the extraction of behavioral information involves methods operating at different levels of granularity, from low level (e.g., people detection, motion estimation) to high level (e.g., emotional states, personality traits). accurate methods operating at each level. This book describes some recent research efforts in the area of human behavior analysis, presenting methodologies for extracting behavioral cues at different levels. Special emphasis is given to recent approaches considering multimodal data to robustly extract behavioral information in real-world settings. Beside covering state-of-the-art research, the book also outlines some open challenges in the field as well as promising future research directions.

    Keywords

    Multimodal data; Human behavior analysis; Realistic conditions

    0.1 Analyzing human behavior in the wild from multimodal data

    Due to its importance in many applications, the automatic analysis of human behavior has been a popular research topic in the last decades. Understanding human behavior is relevant in many fields, such as assistive robotics, human–computer interaction, surveillance and security, to cite only a few.

    The automatic extraction of behavioral cues is an extremely challenging task involving several disciplines ranging from machine learning, signal processing, computer vision, social psychology, etc. Thanks to the recent progress in the area of Artificial Intelligence and deep learning, significant advances have been made in the last few years in the development of systems for human behavior analysis. For instance, technologies for speech recognition and machine translation have significantly improved and they are now able to work in a wide range of real-world settings. Similarly, several advances have been made in the robotics field, witnessed by the advent on the market of robots which accurately recognize and mimic human emotions. More surprisingly, in the last years technologies have appeared which are able to interpret people behaviors even more precisely than human observers. For instance, computer vision researchers have developed systems which can estimate physiological signals (e.g. heart and respiration rate) by analyzing subtle skin color variations from face videos [18] or which can track the position of a moving person behind a wall from the shadow arising on the ground at the base of the wall's edge [6]. Despite this progress, still many current technologies for human behavior analysis have limited applicability and are not robust enough to operate in arbitrary conditions and in real-world settings. In other words, the path towards automatically understanding human behaviors ‘in the wild’ is still to be discovered.

    It is a well-known fact that the automatic analysis of human behavior can benefit from harnessing multiple modalities. While earlier work on behavior understanding focused on an unimodal setting, typically considering only visual or audio data, more recent approaches leverage multimodal information. Investigating methods to process multimodal data is of utmost importance as multiple modalities provide a more complete representation of human behavior. Moreover, data gathered with different sensors can be incomplete or corrupted by noise. In synthesis, unimodal approaches fail to provide a robust and accurate representation of behavioral patterns and smart multimodal fusion strategies are required. However, from the machine learning point of view, multimodal human behavior analysis is a challenging task since learning the complex relationship across modalities is non-trivial.

    The analysis of human behavior from multimodal data has been encouraged in the last few years by the emergence on the market of novel devices, such as wearable watches or smartphones. These devices typically include several sensors (e.g. camera, microphone, accelerometers, bluetooth, etc.), i.e., they are inherently multimodal. Additionally, the diffusion at consumer level of systems such as drones, low-cost infrared or wearable cameras has opened the possibility of studying human behavior considering other types of data, such as images in an ego-vision setting, 2.5D data, or birds-eye view videos. These technologies provide complementary information to traditional sensing modalities, such as distributed camera and microphone networks. For instance, when analyzing social scenes, wearable sensing devices can be exploited in association with data from traditional cameras to localize people and estimate their interactions [3]. Similarly, egocentric videos can be used together with images from surveillance cameras for the purpose of automatic activity recognition or analyzed jointly with audio signals for robust people re-identification [7].

    Besides the widespread diffusion of novel devices, in the last decade the study of human behavior from multimodal data has been also encouraged by the emergence of new methodologies. In particular, research work from social signal processing [20,8] has allowed for significantly advances of the field. Studies have clearly stressed the importance of non-verbal cues such as gestures, gaze, and emotional patterns in human communication and the need of designing methodologies for inferring these cues by processing multimodal data. Similar studies demonstrated the interest of fusing auditory and visual data for lower-level tasks [4,5]. Furthermore, social signal processing has enabled one to improve technologies for the automatic analysis of human behavior thanks to the integrations of concepts from social psychology into machine learning algorithms (e.g. the concepts of proxemics and kinesics are now used in many approaches for automatically detecting groups in social scenes [11,19]).

    Significant advances in the field of understanding human behavior have also been achieved thanks to the (re-)discovery of deep neural networks. Deep learning has significantly improved the accuracy of many systems for extracting behavioral cues under real-world conditions. For instance, in computer vision deep models have been applied with success to the tasks of activity recognition [22], gaze estimation [16], group analysis, etc. Some of the technologies described in this book adopting deep learning architectures have been deployed in real-world settings (e.g. the audio-visual systems described in Chapter 8 have been used by museum visitors). In addition, several research studies have proposed deep learning-based strategies for fusing multimodal data, outperforming previous approaches based on traditional machine learning models.

    The fast and broad progress in Artificial Intelligence has not only enabled great advances in the analysis of human behavior but has also opened new possibilities for generating realistic human-like behavioral data [23,14,17]. Notable examples relate to the synthesis of realistic-looking images of people, to the generation of human-sounding speech as well as to the design of robots that emulate human emotions. Furthermore, the successes achieved with deep learning have also encouraged the research community to address new challenges. For instance, several recent studies tackled the problem of activity forecasting and behavior prediction [1,21]. Other work focused on the rethinking the action–perception loop and devising end-to-end trainable architecture to directly predict actions from raw data [24,15]. In the area of human behavior analysis these studies can be extremely relevant and would ultimately lead for instance to creating unified models for jointly analyzing human social behaviors and controlling intelligent vehicles and social robots.

    While the progress in the study of human behaviors has been considerable, recent work has also pointed out many limitations of current methodologies and systems. For instance, the adoption of deep learning in several applications has highlighted the need for large-scale datasets. Indeed, datasets can be a very limiting issue depending on the application at hand for different reasons, such as labeling cost, privacy, synchronization problems, etc. The research community is pushing towards handling this problem and several datasets have been made available in the last few years for studying human behaviors in the wild. Notable attempts are for instance the efforts made by researchers involved in the Chalearn initiative [13] or in other dataset collection campaigns [10,12,9,2]. Besides the issues with data, several open challenges involve the design of algorithms for inferring behavioral cues. In particular, understanding human behavior requires approaches which operate at different levels of granularity and which are able to infer both low-level cues (e.g. detecting people position or pose) and high-level information (e.g. group dynamics and social interactions, emotional patterns). However, devising methods which deal with tasks at different levels of granularity in isolation is largely suboptimal. Future research efforts should be made devoted to addressing the problem of human behavior analysis in a more holistic manner.

    0.2 Scope of the book

    The main objective of this book is to present an overview of recent advances in the area of behavioral signal processing. A special focus is given to describing the strengths and weaknesses of current methods and technologies which (i) analyze human behaviors by exploiting different data modalities and (ii) are deployed in real-world scenarios. In other words, the two prominent characteristics of the book are the multimodality and the in the wild perspective. Regarding multimodality, the book presents state-of-the-art human behavior understanding methods which exploit information coming from different sensors. Audio and video being the most popular modalities used for analyzing human behavior and activities, they have a privileged role in the manuscript. However, the book also covers methodologies and applications where emerging modalities such as accelerometer or proximity data are exploited for behavior understanding. Regarding the in the wild aspect, the book aims to describe the current usage, limitations and challenges of systems combining multimodal data, signal processing and machine learning for the understanding of behavioral cues in real-world scenarios. Importantly, the book covers tasks at different levels of complexity, from low level (speaker detection, sensorimotor links, source separation), through middle level (conversational group detection, activity recognition) to high level (affect and emotion recognition).

    This book is intended to be a resource for experts and practitioners interested on the state of the art and future research challenges of multimodal behavioral analysis in the wild. It is suitable for researchers and graduate students in the fields of computer vision, audio processing, pattern recognition, multimedia analysis, machine learning, robotics, and social signal processing. The chapters of the book are organized according to three main directions, corresponding to three different application domains as illustrated in Fig. 0.1.

    Figure 0.1 Overview of the structure of the book chapters.

    The first series of chapters mostly deal with the problem of behavior understanding and multimodal fusion in the context of robotics and Human–Robot Interaction (HRI). In particular, Chapter 1 focuses on the development of dialog systems for robotic platforms and addresses two important challenges: how to move from closed-domain to open-domain dialogues, and how to create multimodal (audio-visual) dialog systems. The authors describe an approach to jointly tackle these two problems by proposing a Constructive Dialog Model and show how they handle the topic shifts considering Wikipedia as external resource. Chapter 2 describes a robust methodology for audio-motor integration applied to robot hearing. In robotics, audio signal processing in the wild amounts to deal with sounds recorded by a system that moves and whose actuators produce noise. This creates additional challenges in sound source localization, signal enhancement and recognition. But the specificity of such platforms also brings interesting opportunities: can information about the robot actuators' states be meaningfully integrated in the audio processing pipeline to improve performance and efficiency? While robot audition grew to become an established field, methods that explicitly use motor-state information as a complementary modality to audio are scarcer. This chapter proposes a unified view of this endeavor, referred to as audio-motor integration. A literature review and two learning-based methods for audio-motor integration in robot audition are presented, with application to single-microphone sound source localization and ego-noise reduction on real data. Chapter 3 reviews the literature related to multichannel audio source separation in real-life environments. The authors explore some of the major achievements in the field and discuss some of the remaining challenges. Several important issues, e.g. moving sources and/or microphones, varying numbers of sources and sensors, high reverberation levels, spatially diffuse sources, synchronization, etc., are extensively discussed. Many applicative scenarios, such as smart assistants, cellular phones, hearing aids and robots, are presented together with the most prominent associated methodologies. The chapter concludes with open challenges and interesting future guidelines on the topic.

    A second series of chapters describe methodologies for fusing multimodal data collected with wearable technologies. In particular, Chapter 4 describes the development of novel wearable glasses aimed to assist users with limited technology skills and disabilities. The glasses process audio-visual data and are equipped with technologies for visual object recognition to support users with low vision as well as with algorithms for enhancing speech signals for people with hearing loss impairment. The chapter further illustrates the results of a user study conducted with people with disabilities in real-world settings. Chapter 5 also focuses on analyzing audio-visual data from wearable sensors and describes an approach for person re-identification where information from audio signals is exploited to complement image streams in the case of challenging conditions (e.g., rapid changes in camera pose, self-occlusions, motion blur, etc.). Similarly, Chapter 6 considers video streams collected with wearable cameras. The authors address the problem of recognizing activities from visual lifelogs and, after outlining the main challenges of this task, they perform a detailed review of state-of-the-art methods and show the results of an extensive experimental comparison. An interesting application of visual lifelogs is described in Chapter 7. The authors present an approach to automatically analyzing images collected from wearable cameras in order to extract a nonredundant set of frames useful for the purpose of memory stimulation in patients with neurodegenerative diseases (Alzheimer, Mild Cognitive Impairment, etc.). Chapters 8 and 9 describe how wearable technologies, alone or in combination with more traditional static and distributed sensors, can be used to analyze visitor behavior in museums. In particular, these chapters address the challenges of interpreting raw multimodal data for the purpose of visitor tracking and improving tourist experience.

    Chapters 10–15 mostly describe recent methodologies and open problems in the analysis of social scenes. Specifically, Chapters 10 and 11 present approaches which exploit data from wearable sensors for the purpose of understanding social interactions. In particular, Chapter 10 addresses the problem of discovering conversational groups (more precisely, F-formations) in egocentric videos depicting social gatherings and presents an algorithm based on Structural SVM. Chapter 11 also focuses on the challenges of analyzing conversational scenes and, in particular, illustrates the limitations of current datasets publicly available for the automated analysis of human social behavior. The authors also describe the conceptual and practical issues inherent to the data collection process, with a specific focus on the multimodality and the ‘in the wild’ perspective. The problem of analyzing social interactions and detecting conversational groups is also addressed in Chapter 12. In particular, a methodology for recognizing F-formations derived from game theory is described but, differently from Chapter 10, the approach is tested on data from static surveillance cameras. Chapter 13 also addresses the problem of analyzing social interactions. The authors point out that understanding nonverbal behavioral cues (e.g., facial expressions, gaze, gestures, etc.) is important both from a human science perspective, as it helps to understand how people work, and from a technological point of view, because it allows one to design systems can make sense of social and psychological phenomena. Chapter 14 focuses on crowd analysis, different from the work presented in the previous chapters, which deals with social scenes with a small number of people. The chapter describes the challenges of understanding crowd behaviors in realistic settings and provides an overview of state of the art approaches for analyzing visual data and detecting motion patterns, tracking people, recognizing activities and spotting anomalous behaviors. A methodological contribution is presented in Chapter 15, where the problem of learning robust and invariant representations in visual recognition systems is considered. This issue is of utmost importance when deploying systems operating in the wild.

    The last series of chapters describe methodologies for detecting affective and emotional patterns in real-world settings. In particular, Chapter 16 focuses on the problem of visual affect recognition. The chapter addresses the challenges of bridging the ‘affective gap’ between visual features and semantic concepts. Following the Adjective Noun Pair (ANP) paradigm, i.e. considering mid-level representations of pairs of noun-adjectives such as ‘scary face’, ‘beautiful women’, etc., the authors present an approach for sentiment and emotion prediction which operates by embedding ANP constructs in a latent space.

    Chapter 17 addresses the problem of video-based emotion recognition ‘in the wild’ and describes an approach for fusing audio-visual data. The method uses summarizing functionals of complementary visual descriptors in conjunction with audio features. The audio and visual data are fused within a least squares classifier framework. The authors report state-of-the-art results on the EmotiW Challenge. Chapter 18 also considers the problem of emotion recognition from audiovisual signals in real-world environments. This chapter highlights the differences between affect recognition in real-world and laboratory settings, provides an overview of state of the art methodologies, and illustrates an audio-visual continuous emotion recognition system based on deep learning. Similarly, Chapter 19 focuses on affective facial computing with a special emphasis on addressing the ‘in the wild’ aspect. Specifically, the authors consider the generalizability of facial computing technologies across various domains and propose a review of several previous studies on the topic. The outcome of their study is that the ability of current systems to generalize across domains is limited and that, in this context, transfer learning and domain adaptation methodologies are a precious resource. Finally, Chapter 20 discusses the problem of emotion recognition from behavioral data, with special emphasis on distinguishing between self-reported and perceived emotions. The authors further analyze how this aspect influences the design of systems for emotion recognition and outline recent advances and challenges in this research topic.

    0.3 Summary of important points

    This book aims to describe recent works in the area of human behavior analysis, with special emphasis on studies considering multimodal data. Besides providing an overview on state-of-the-art research in the field, the book highlights the main challenges associated with the automatic analysis of human behaviors in real-world scenarios, discussing the limitations of existing technologies. The chapters of the book describe a large variety of methodologies to extract behavioral cues from multimodal data and consider different applications. This clearly demonstrates that the topic addressed by the book can be of interest for a large set of researchers and graduate students working in different fields.

    References

    [1] Alexandre Alahi, Vignesh Ramanathan, Kratarth Goel, Alexandre Robicquet, Amir A. Sadeghian, Li Fei-Fei, Silvio Savarese, Learning to predict human behavior in crowded scenes, Group and Crowd Behavior for Computer Vision. 2017:183–207.

    [2] Xavier Alameda-Pineda, Jacopo Staiano, Ramanathan Subramanian, Ligia Batrinca, Elisa Ricci, Bruno Lepri, Oswald Lanz, Nicu Sebe, Salsa: a novel dataset for multimodal group behavior analysis, IEEE Trans. Pattern Anal. Mach. Intell. 2016;38(8):1707–1720.

    [3] Xavier Alameda-Pineda, Yan Yan, Elisa Ricci, Oswald Lanz, Nicu Sebe, Analyzing free-standing conversational groups: a multimodal approach, ACM International Conference on Multimedia. 2015:5–14.

    [4] Yutong Ban, Laurent Girin, Xavier Alameda-Pineda, Radu Horaud, Exploiting the complementarity of audio and visual data in multi-speaker tracking, IEEE/CVF ICCV Workshop on Computer Vision for Audio-Visual Media. 2017.

    [5] Yutong Ban, Xiaofei Li, Xavier Alameda-Pineda, Laurent Girin, Radu Horaud, Accounting for room acoustics in audio-visual multi-speaker tracking, IEEE International Conference on Acoustic, Speech and Signal Processing. 2018.

    [6] Katherine L. Bouman, Vickie Ye, Adam B. Yedidia, Frédo Durand, Gregory W. Wornell, Antonio Torralba, William T. Freeman, Turning corners into cameras: principles and methods, International Conference on Computer Vision. 2017.

    [7] Alessio Brutti, Andrea Cavallaro, Online cross-modal adaptation for audio-visual person identification with wearable cameras, IEEE Trans. Human-Mach. Syst. 2017;47(1):40–51.

    [8] Judee K. Burgoon, Nadia Magnenat-Thalmann, Maja Pantic, Alessandro Vinciarelli, Social Signal Processing. Cambridge University Press; 2017.

    [9] Laura Cabrera-Quiros, Andrew Demetriou, Ekin Gedik, Leander van der Meij, Hayley Hung, The MatchNMingle dataset: a novel multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates, IEEE Trans. Affect. Comput. 2018 10.1109/TAFFC.2018.2848914.

    [10] Tatjana Chavdarova, Pierre Baqué, Stéphane Bouquet, Andrii Maksai, Cijo Jose, Louis Lettry, Pascal Fua, Luc Van Gool, François Fleuret, The wildtrack multi-camera person dataset, arXiv preprint arXiv:1707.09299; 2017.

    [11] Marco Cristani, R. Raghavendra, Alessio Del Bue, Vittorio Murino, Human behavior analysis in video surveillance: a social signal processing perspective, Neurocomputing 2013;100:86–97.

    [12] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al., Scaling egocentric vision: the epic-kitchens dataset, arXiv preprint arXiv:1804.02748; 2018.

    [13] Sergio Escalera, Xavier Baró, Hugo Jair Escalante, Isabelle Guyon, Chalearn looking at people: a review of events and resources, International Joint Conference on Neural Networks. 2017:1594–1601.

    [14] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, Alexandre Alahi, Social GAN: socially acceptable trajectories with generative adversarial networks, IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018.

    [15] Guan-Horng Liu, Avinash Siravuru, Sai Prabhakar, Manuela Veloso, George Kantor, Learning end-to-end multimodal sensor policies for autonomous navigation, arXiv preprint arXiv:1705.10422; 2017.

    [16] Adria Recasens, Aditya Khosla, Carl Vondrick, Antonio Torralba, Where are they looking? Advances in Neural Information Processing Systems. 2015:199–207.

    [17] Aliaksandr Siarohin, Enver Sangineto, Stéphane Lathuilière, Nicu Sebe, Deformable GANs for pose-based human image generation, IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018.

    [18] Sergey Tulyakov, Xavier Alameda-Pineda, Elisa Ricci, Lijun Yin, Jeffrey F. Cohn, Nicu Sebe, Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions, IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2016:2396–2404.

    [19] Jagannadan Varadarajan, Ramanathan Subramanian, Samuel Rota Bulò, Narendra Ahuja, Oswald Lanz, Elisa Ricci, Joint estimation of human pose and conversational groups from social scenes, Int. J. Comput. Vis. 2018;126(2–4):410–429.

    [20] Alessandro Vinciarelli, Maja Pantic, Hervé Bourlard, Social signal processing: survey of an emerging domain, Image Vis. Comput. 2009;27(12):1743–1759.

    [21] Carl Vondrick, Deniz Oktay, Hamed Pirsiavash, Antonio Torralba, Predicting motivations of actions by leveraging text, IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2016:2997–3005.

    [22] Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, Lisha Hu, Deep learning for sensor-based activity recognition: a survey, Pattern Recognit. Lett. 2018 10.1109/CVPR.2016.327.

    [23] Wei Wang, Xavier Alameda-Pineda, Dan Xu, Pascal Fua, Elisa Ricci, Nicu Sebe, Every smile is unique: landmark-guided diverse smile generation, IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:7083–7092.

    [24] Huazhe Xu, Yang Gao, Fisher Yu, Trevor Darrell, End-to-end learning of driving models from large-scale video datasets, IEEE/CVF Computer Vision and Pattern Recognition. 2017.

    Chapter 1

    Multimodal open-domain conversations with robotic platforms

    Kristiina Jokinen⁎; Graham Wilcock†    ⁎AIST/AIRC, Tokyo, Japan

    †CDM Interact, Helsinki, Finland

    Abstract

    The chapter discusses how to move from closed-domain dialogs to open-domain dialogs, and from speech-based dialogs to multimodal dialogs with speech, gestures, and gaze, using robot agents. We briefly describe the Constructive Dialog Model, the foundation for our work. Management of topic shifts is one of the challenges for open-domain dialogs, and we describe how Wikipedia can be used for topic shifts as well as an open-domain knowledge source. Multimodal issues are illustrated by our multimodal WikiTalk open-domain robot dialog system. Two future research directions are discussed: the use of domain ontologies in dialog systems and the need to integrate robots with the Internet of Things.

    Keywords

    Multimodal communication; Constructive dialog model; Human–robot interaction; Open-domain dialogs

    Chapter Outline

    1.1  Introduction

    1.1.1  Constructive Dialog Model

    1.2  Open-domain dialogs

    1.2.1  Topic shifts and topic trees

    1.2.2  Dialogs using Wikipedia

    1.3  Multimodal dialogs

    1.3.1  Multimodal WikiTalk for robots

    1.3.2  Multimodal topic modeling

    1.4  Future directions

    1.4.1  Dialogs using domain ontologies

    1.4.2  IoT and an integrated robot architecture

    1.5  Conclusion

    References

    1.1 Introduction

    From a historical point of view, the development of natural language conversational systems has accelerated in recent years due to advances in computational facilities and multimodal dialog modeling, availability of big data, statistical modeling and deep learning, and increased demand in commercial applications. In a somewhat simplified manner, we can say that the capability of conversational systems has improved in roughly 20-year time-spans if seen from the viewpoint of technological advancements: from ELIZA's imitation of human-like properties in the early 1960s, via systems that understand spoken natural language and various multimodal acts in limited domains developed in the 1980s, to interactive systems that are part of everyday environments in the 21st century. Embodied conversational agents, chatbots, Siri, Amazon Alexa, Google Home, etc. are examples of the multitude of interactive systems that aim to provide natural language capabilities for question answering and to search for useful information in the cloud.

    The rapid development of robot technology has had a huge impact on interaction research and in particular, on developing social robotics, i.e. human–robot applications, where the robot can provide natural language communication with a user, and be able to observe and understand the user's needs and emotions. This enables a novel type of interaction where the robot is not just a tool to do things, but an agent to communicate with: social robots can interact with human users in natural language, and support companionship and peer-type assistance which feature information-providing as well as chatting and sensitivity to social aspects of interaction. Co-located acting and free observations of the partner are both beneficial and challenging for interaction modeling. Interaction becomes richer and more natural, but also more complicated: learning the various social signals and constructing a shared context for the interaction (cf. [15]).

    Social robotics emphasizes the robot's communication skills besides its autonomous decision-making and moving around in the environment. Social robots show more human-like interaction and try to act in a proactive manner so as to support human interest and activity. Consequently, social robotics has had a huge impact on interaction technology. Communication is simultaneously visual, verbal, and vocal, i.e. humans not only utter words, but use various vocalizations (laughs, coughs), head, gaze, hands, and the whole body to convey messages. In order to understand human behavior and communicative needs, the robot should observe the user's multimodal signals and be able to generate reasonable behavior patterns in interactive situations. The main hypothesis is that the more engaging the interaction is in terms of communicative competence, the better results the interaction produces, whether or not the task that the user is involved in with the robot concerns friendly chatting or some more structured task.

    The chapter is structured as follows. The next section describes the Constructive Dialog Model which forms the foundation for our work. Section 1.2 discusses issues in moving from closed-domain dialogs to open-domain dialogs, including how to manage topic shifts and how to use Wikipedia as a knowledge source. Section 1.3 addresses multimodal interaction with the Nao robot by speech, gesturing and face-tracking, and multimodal aspects of topic modeling. Section 1.4 briefly presents two future research directions, the use of domain ontologies in dialog systems and the need to integrate robots with the Internet of Things. Section 1.5 presents conclusions.

    1.1.1 Constructive Dialog Model

    Conversational interactions are cooperative activities through which the interlocutors build common ground (Clark and Schaefer [8]). Cooperation indicates the interlocutors' basic willingness to engage in the conversation, and manifests itself in smooth turn-taking and coherent replies. The agents react to the situation according to their beliefs, intentions and interpretation of the situation, and they use multimodal signals to indicate how the basic enablements of communication are fulfilled.

    In Fig. 1.1, interaction in the Constructive Dialog Model (CDM) [15] is seen as a cycle which starts with the participants being in contact, observing the partner's intent to communicate, interpreting the partner's communicative content, and producing their own reaction to the message in an appropriate manner. Fig. 1.1 shows the communication cycle with the basic enablements of communication, which concern Contact, Perception, Understanding, and Reaction (Allwood [1], Jokinen [15]).

    Figure 1.1 The communication cycle in the Constructive Dialog Model.

    Contact refers to the participants' mutual awareness of their intention to communicate, i.e. being close enough to be able to communicate or having a means to communicate such as a phone or skype if not in a face-to-face situation. Perception relates to the participants' perception of communicative signals as a message with an intent. Understanding concerns the participants' cognitive processes to interpret the message in the given context. Reaction is the speakers' observable behavior which manifests their reaction to the new changed situation in which the agents find themselves.

    In the CDM system architecture in Fig. 1.2, signal detection and signal analysis modules implement Contact and Perception, respectively, for speech, gesture, and gaze recognition, and these components are responsible for interpreting the user awareness. Understanding is implemented by the decision-making and related modules, while Reaction corresponds to the production and coordination of utterances and motoric actions, including internal updates. Together these two are responsible for the system's engagement with the user. The dialog management is based on dialog states (also called mental states) which are representations of the situation the robot is in and the situation it believes the user is in.

    Figure 1.2 An implementation of the CDM model in a system architecture.

    Many neurocognitive studies show how activation in the brain is triggered by the mere appearance of a human in the vicinity of a person, while attention is directed to a human face (Levitski et al. [22]). The robot agent obtains information from the environment via its sensors and the dialog component integrates them into the system knowledge base through its recognition and decision-making processes. The perception of the partner concerns recognition of the signals as having some communicative meaning: the face belongs to a particular person, the sounds belong to a particular language, and gesturing has communicative content.

    Interpretation of the signals concerns their further processing to form a meaningful semantic representation in the given context. The new information entered into the system will trigger a reaction, i.e. cognitive processes which evaluate the new information with respect to the agent's own goals and the decision-making process which results in carrying out an action that in turn triggers a similar analysis and generation process in the partner. If the speaker is repeatedly exposed to a certain kind of communicative situation and if the speaker's communicative action results in a successful goal achievement, the same action will be used again, to maximize benefits in the future.

    To construct common ground, the interlocutors thus pay attention to signals that indicate the partner's reaction to the conveyed message, their emotional state, and the new information in the partner's speech. Non-verbal signals such as pauses, intonation, nods, smiles, frowns, eye-gaze, gesturing etc. are effectively used to signal the speaker's understanding and emotions (Feldman and Rim [12]). Studies on embodied conversational agents have widely focused on various aspects of interaction, multimodality, and culturally competent communication (see e.g. Andre and Pelachaud [4], Jokinen et al. [16,17]). For instance, in human–human interactions, gestures and head movements complement language expressions and enable the interlocutors to coordinate the dialog in a tacit manner. Gesturing is synchronized with speech, and besides the pointing gestures and iconic gestures that refer to objects and describe events, gesturing also provides rhythmic accompaniment to speech (co-gesturing) which contributes to the fluency of expression and construction of shared understanding.

    A related question is the robot's understanding of the relation between language and the physical environment, i.e. the embodiment of linguistic knowledge via action and interaction. The connection between linguistic expressions and objects in the environment is called grounding. In dialog modeling, the term grounding is usually used to refer to the interlocutors' actions that enable them to build mutual understanding of the referential elements. Grounding in interactions can be studied with respect to the notion of affordance (Jokinen [15]): the interlocutors' actions (speech and gesturing) should readily support a natural and smooth way of communication. Interactions between humans and robots as well as between humans and intelligent environments should enable easy recognition of various communicatively important objects, and the different objects must be distinguished from each other so as to be correctly referred to.

    The CDM framework is applied in the WikiTalk open-domain robot dialog system (Jokinen and Wilcock [21]) where both the human and the robot can initiate topics and ask questions on a wide variety of topics. It is also applied in human–robot interaction such as newspaper-reading or story-telling in elder-care and educational activities. It aims at acquiring a good level of knowledge about the user and his/her context and can thus enable an open-domain conversation with the user, presenting useful and interesting information. As the robot can observe user behavior, it can infer a user's emotion and interest levels, and tailor its presentation accordingly.

    1.2 Open-domain dialogs

    In traditional closed-domain dialog systems, such as flight reservation systems, the system asks questions in order to achieve a specified dialog goal. Finite state machines can be used for this kind of closed-domain form-filling dialog. The system asks questions, which are predefined for the specific domain, in order to achieve the dialog goal by filling in the required fields in the form. It is easy to change the information within the domain database, for example about flights and destinations, but it is very difficult to change to a different domain because the questions are specific to the domain.

    In order to advance from closed-domain to open-domain dialogs, WikiTalk [26] uses Wikipedia as its source of world knowledge. By exploiting ready-made paragraphs and sentences from Wikipedia, the system enables a robot to talk about thousands of different topics (there are 5 million articles in English Wikipedia). WikiTalk is open-domain in the sense that the currently talked-about topic can be any topic that Wikipedia has an article about, and the user can switch topics at any time and as often as desired. In an open-domain system it is extremely important to keep track of the current topic and to have smooth mechanisms for changing to new topics.

    1.2.1 Topic shifts and topic trees

    An important feature that enables an interactive system to manage dialogs in a natural manner is its ability to handle smooth topic shifts, i.e. to be able to provide a relevant continuation in the current dialog state. The underlying problem is that knowing what is relevant depends on the overall organization of knowledge.

    The organization of knowledge into related topics has often been done with the help of topic trees. Originally focus trees were proposed by McCoy and Cheng [23] to trace foci in natural language generation systems. The branches of the tree describe what sort of topic shifts are cognitively easy to process and can be expected to occur in dialogs: random jumps from one branch to another are not very likely to occur, and if they do, they should be appropriately marked. McCoy and Cheng [23] dealt with different types of focusing phenomena by referring to a model of the conceptual structure of the domain of discourse. They introduced the notion of focus tree and argued that the tree structure is more flexible in managing focus shifts than a stack: instead of pushing and popping foci in a particular order into and from the stack, the tree allows traversal of the branches in a different order, and the coherence of the text can be determined on the basis of the distance of the focus nodes in the tree.

    The focus tree is a subgraph of the world knowledge, built in the course of the discourse on the basis of the utterances that have occurred so far. The tree both constrains and enables prediction of what is likely to be talked about next, and thus provides a top-down approach to dialog coherence. The topic (focus) is a means to describe thematically coherent discourse structure, and its use has been mainly supported by arguments regarding anaphora resolution and processing effort. Focus shifting rules are expressed in terms of the type of relationships which occur in the domain. In language generation, they provide information about whether or not a topic shift is easy to process (and, similarly, whether or not the hearer will expect some kind of marker), and in language analysis they help to decide on what sort of topic shifts are likely to occur. Jokinen, Tanaka and Yokoo [19] applied the idea of focus tree in spoken dialog processing. They made the distinctions between topical vs. non-topical informational units, i.e. what the utterance is about vs. what is in the background, and new vs. old information in the dialog context.

    Grosz, Weinstein and Joshi [14] distinguished between global and local coherence, as well as between global focus and centering, respectively. The former refers to the ways in which larger segments of discourse relate to each other, and accordingly, global focus refers to a set of entities that are relevant to the overall discourse. The latter deals with individual sentences and their combinations into larger discourse segments, and accordingly, centering refers to a more local focusing process which identifies a single entity as the most central one in an individual sentence. Each sentence can thus be associated with a single backward-looking center which encodes the notion of global focusing and a set of forward-looking centers which encodes the notion of centering.

    The organization of knowledge has always been one of the big questions, but we can now look for help with this question from the internet. In fact we can assume that world knowledge is somehow stored in the internet and we wish to take advantage of this. Previously, topic trees were hand-coded which is time-consuming and subjective. Automatic clustering programs were also used but were not entirely satisfactory. Our approach to topic trees exploits the organization of domain knowledge in terms of topic types found in the web, and more specifically in Wikipedia.

    We use topic information in predicting the likely content of the next utterance, and thus we are more interested in the topic types that describe the information conveyed by utterances than the actual topic entity. Consequently, instead of tracing salient entities in the dialog and providing heuristics for different shifts of attention, we seek a formalization of the information structure of utterances in terms of the new information that is exchanged. Wikipedia provides an extensive, freely available, open-domain and constantly growing knowledge source. We therefore use Wikipedia to produce robot contributions in open-domain

    Enjoying the preview?
    Page 1 of 1