Multi-Modal Hierarchical Empathetic Framework for Social Robots With Affective Body Control

PDF

Abstract

Social robots require the ability to understand human emotions and provide affective and behavioral responses during human-robot interactions. However, current social robots lack empathy capabilities. In this work, we propose a novel Multi-modal Hierarchical Empathetic (MHE) framework for generating empathetic responses for social robots. MHE is composed of a \textit{multi-modal fusion and emotion recognition} module, an \textit{empathetic dialogue generation} module, and an \textit{expression generation} module. By fusing the sensor signals of different modalities, the robot can recognize human emotions and generate affective responses. Multiple experiments are conducted on a real robot, Pepper, to evaluate the proposed framework. The experiments are conducted to discriminate between MHE-generated text and human responses in complete ignorance, and most experimenters agree that MHE can effectively generate human-like and empathetic responses. To better evaluate the similarity between human-robot and human-human interactions, a period eye movement map (PEM) captured by an eye tracker is proposed. The experimental results demonstrate the improvement in the MHE in human-robot interactions by comparing different PEMs.

Method

This work proposes a Multi - modal Hierarchical Empathetic framework (MHE) for general social robots equipped with multi - modal sensors and body controllers. Firstly, it proposes a multi - modal empathetic dialogue generation model with multi - task training. An embedding attention network is used to decode multi - modal representations and textual features, enhancing multi - modal empathy. Secondly, for generating affective body movements, joint positions and speeds are controlled based on emotion recognition results and sentiment intensity, ensuring consistency with empathetic dialogue generation. Thirdly, to provide a more objective evaluation, a Period Eye Movement Map (PEM) is formulated. Eye movement data captured by an eye tracker is presented in a distribution map, and the similarity between human - robot interaction and human - to - human interaction is evaluated by comparing different PEMs.

Fig. 1 The structure of the proposed MHE framework is composed of three modules: (i) multi-modal fusion and emotion recognition, (ii) empathetic dialogue generation, and (iii) expression generation. Multi-modal signals are processed by module (i) to obtain emotion prediction and multi-modal representation. Module (ii) fuses the word embedding and multi-modal representation by embedding attention and the attention output is decoded to the empathetic dialogue. Module (iii) integrates body movements and speech synthesis results to control the robot’s empathetic expression.
Fig1
Fig. 2 The cross-modal fusion calculation process in our framework. The features of different modalities are first paired and fused by using cross-modal attention.
Fig1

Result

Response scores in different emotion cases. H denotes responses generated by humans, H w/o E denotes responses generated by humans without empathy, and MHE w/o E denotes responses generated by MHE without empathy (the multi-modal emotion recognition module is removed from MHE). The results of each emotion case were calculated from 778 samples, and mean values were marked in the center of each histogram with standard deviation.
Fig2
The structural similarity (SSIM) is utilized to evaluate the similarity between different attention distributions. The average SSIM value and the corresponding standard deviation are shown on the left. Robot with MHE represents the robot with the complete MHE model in sub-task 1, and robot w/o represents the robot without the MHE mode in sub-task 2.
Fig2 Fig2