EvSign: Sign Language Recognition and Translation
with Streaming Events

Pengyu Zhang 1,2      Hao Yin 1      Zeren Wang 1      Wenyue Chen 1
Shengming Li 1      Dong Wang 1       Huchuan Lu 1      Xu Jia 1
1 IIAU, Dalian University of Technology   2 National University of Singapore  

GIF Image

News

  1. 7/12/2024: The EvSign dataset is released at Google drive and Baidu disk.
  2. 7/1/2024: The EvSign dataset is accepted by ECCV 2024. The paper is available on ArXiv

Abstract


Sign language is one of the most effective communication tools for people with hearing difficulties. Most existing works focus on improving the performance of sign language tasks on RGB videos, which may suffer from degraded recording conditions, such as fast movement of hands with motion blur and textured signer's appearance. The bio-inspired event camera, which asynchronously captures brightness change with high speed, could naturally perceive dynamic hand movements, providing rich manual clues for sign language tasks. In this work, we aim at exploring the potential of event camera in continuous sign language recognition~(CSLR) and sign language translation~(SLT). To promote the research, we first collect an event-based benchmark EvSignbfor those tasks with both gloss and spoken language annotations. EvSign dataset offers a substantial amount of high-quality event streams and an extensive vocabulary of glosses and words, thereby facilitating the development of sign language tasks. In addition, we propose an efficient transformer-based framework for event-based SLR and SLT tasks, which fully leverages the advantages of streaming events. The sparse backbone is employed to extract visual features from sparse events. Then, the temporal coherence is effectively utilized through the proposed local token fusion and gloss-aware temporal aggregation modules. Extensive experimental results are reported on both simulated~(PHOENIX14T) and EvSign datasets. Our method performs favorably against existing state-of-the-art approaches with only 0.34% computational cost (0.84G FLOPS per video) and 44.2% network parameters.

Download

Event data

Raw event Baidu disk Google drive
Voxel grid Baidu disk Google drive
Annotations Baidu disk Google drive

RGB data

RGB data is comming soon.

Method & Results

PNG Image
  1. Efficient feature extraction : We introduce a sparse version of ResNet18 to extract event features.
  2. Hierarchical temporal modeling for streaming event : First, we build an effective representation for continuous actions within a short duration and reduce computational load via Local Token Fusion(LTF). Then, we propose a Gloss-Aware Temporal Aggregation(GATA) to exploit the correspondence of continuous signs in long-term videos.
PNG Image

Citation

@InProceedings{Zhang_ECCV24_EvSign,
          author = {Zhang Pengyu and Hao Yin and Zeren Wang and Wenyue Chen 
            and Shengming Li and Dong Wang and Huchuan Lu and Xu Jia},
          title = {EvSign: Sign Language Recognition and Translation with Streaming Events},
          booktitle = {European Conference on Computer Vision},
          year = {2024}}
          

Contact

If you have any question, please contact Pengyu Zhang at zpy.iiau@gmail.com.