SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

SignAvatars is the first large-scale 3D sign language holistic motion dataset with mesh annotations, which comprises 8.34M precise 3D whole-body SMPL-X annotations, covering 70K motion sequences. The corresponding MANO hand version is also provided.

Figure 1: Figure 1: Overview of SignAvatars, the first public large-scale multi-prompt 3D sign language holistic motion dataset. (upper row) We introduce a generic method to automatically annotate a large corpus of video data. (lower row) We propose the first 3D SLP benchmark to generate plausible 3D holistic mesh motion.

Videos

Abstract

In this paper, we present SignAvatars, the first large-scale multi-prompt 3D sign language (SL) motion dataset designed to bridge the communication gap for hearing-impaired individuals. While there has been an exponentially growing number of research regarding digital communication, the majority of existing communication technologies primarily cater to spoken or written languages, instead of SL, the essential communication method for hearing-impaired communities. Existing SL datasets, dictionaries, and sign language production (SLP) methods are typically limited to 2D as the annotating 3D models and avatars for SL is usually an entirely manual and labor-intensive process conducted by SL experts, often resulting in unnatural avatars. In response to these challenges, we compile and curate the SignAvatars dataset, which comprises 70,000 videos from 153 signers, totaling 8.34 million frames, covering both isolated signs and continuous, co-articulated signs, with multiple prompts including HamNoSys, spoken language, and words. To yield 3D holistic annotations, including meshes and biomechanically-valid poses of body, hands, and face, as well as 2D and 3D keypoints, we introduce an automated annotation pipeline operating on our large corpus of SL videos. SignAvatars facilitates various tasks such as 3D sign language recognition (SLR) and the novel 3D SL production (SLP) from diverse inputs like text scripts, individual words, and HamNoSys notation. Hence, to evaluate the potential of SignAvatars, we further propose a unified benchmark of 3D SL holistic motion production. We believe that this work is a significant step forward towards bringing the digital world to the hearing-impaired communities.

SignAvatars Dataset Modality

Sign Language Motion Annotations


Figure 3: Illustration of the automatic motion annotation pipeline. Given an RGB image sequence as input for hierarchical initialization, it is followed by optimization with temporal smoothing and biomechanical constraints. Finally, it outputs the final results in a motion sequence of SMPL-X parameters


Application: SignVAE for SLP


Figure 7: Overview of our 3D SLP network. Our method consists of a two-stage process. We first create semantic and motion codebooks using two VQ-VAEs, mapping inputs to their respective code indices. Then, we employ an auto-regressive model to generate motion code indices based on semantic code indices, ensuring a coherent understanding of the data.


FID analysis