Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset

Zhejiang University
*Indicates Equal Contribution
Code Dataset

Abstract

Understanding human emotions is fundamental to enhancing human-computer interaction, especially for embodied agents that mimic human behavior. Traditional emotion analysis often takes a third-person perspective, limiting the ability of agents to interact naturally and empathetically. To address this gap, this paper presents E3 for Exploring Embodied Emotion, the first massive first-person view video dataset. E3 contains more than 70 hours of video, capturing 8 different emotion types in diverse scenarios and languages. The dataset features videos recorded by individuals in their daily lives, capturing a wide range of real-world emotions conveyed through visual, acoustic, and textual modalities. By leveraging this dataset, we define 4 core benchmark tasks - emotion recognition, emotion classification, emotion localization, and emotion reasoning - supported by more than 80k manually crafted annotations, providing a comprehensive resource for training and evaluating emotion analysis models. We further present Emotion-LlaMa, which complements visual modality with acoustic modality to enhance the understanding of emotion in first-person videos. The results of comparison experiments with a large number of baselines demonstrate the superiority of Emotion-LlaMa and set a new benchmark for embodied emotion analysis. We expect that E3 can promote advances in multimodal understanding, robotics, and augmented reality, and provide a solid foundation for the development of more empathetic and context-aware embodied agents.

Task

MY ALT TEXT Examples in E3. E3 collects massive first-person view videos in which the camera wearer (denoted as C) actively engages in the video activities. Each video in the dataset is annotated manually with fine-grained labels to support four embodied emotion analysis tasks.

Dataset Statistics

MY ALT TEXT
(a) Distribution of video duration, answer length and scenes.
MY ALT TEXT
(b) Distribution of emotions timestamps and topics.
MY ALT TEXT
(c) Distribution of emotion categories.

Emotions and Topics Relationship Graph