Introduction
The Competitions
The Datasets
Time Lapse of Landmarks
Installing the Viewers
Viewing the "asl-signs" Dataset
Viewing the "asl-fingerspelling" Dataset
Viewing a "Live Feed
Conclusion
Acknowledgements
Version History
References

Published August 12, 2024 © Apache-2.0

Insightful Datasets for ASL recognition

An exploration of the Kaggle datasets for ASL Recognition.

BeginnerProtip1 hour868

Things used in this project

Software apps and online services

OpenCV

Google MediaPipe

Story

Introduction

This project provides a deep dive exploration into the Kaggle datasets provided by Google for their 2023 competitions related to ASL recognition:

Google - Isolated Sign Language Recognition
Google - American Sign Language Fingerspelling Recognition

Since an image is worth a thousand words, a video is certainly worth even more. To that end, I created viewers for these two datasets, as well as a similar viewer for a live USB camera feed. I chose to display the landmarks generated by the MediaPipe framework (holistic model, including pose, face, hands), as well as a time lapse of the most important landmarks.

MediaPipe holistic landmark viewer (📹: AlbertaBeef)

In the video, I have captured myself executing two styles of sign language:

sign language (ie. words and phrases) : THANK YOU
fingerspelling : A, V, N, E, T

The Kaggle datasets provide captured examples for each style. The latter is used for content that is not covered with the sign language itself, such as phone numbers, web URLs, etc...

The Competitions

In 2023, Google launched two open-source competitions on Kaggle, totaling $300, 000 in prizes.

Google - Isolated Sign Language Recognition

https://www.kaggle.com/competitions/asl-signs/overview
Feb-May 2023
$100, 000 Prize

[Kaggle] Isolated Sign Language Recognition (📷: Kaggle)

Google - American Sign Language Fingerspelling Recognition

https://www.kaggle.com/competitions/asl-fingerspelling
May-Aug 2023
$200, 000 Prize

[Kaggle] Fingerspelling Recognition (📷: Kaggle)

It is very interesting to analyze the winning solution for the Fingerspelling Recognition competition:

https://www.kaggle.com/competitions/asl-fingerspelling/discussion/434485

[1st place solution] Improved Squeezeformer + TransformerDecoder + Clever augmentations (📷: Kaggle)

The input to the solution are the following subset of 130 landmarks:

21 key points from each hand
6 pose key points from each arm
76 from the face (lips, nose, eyes)

In other words, the output of the MediaPipe framework (holistic model), the hand, face, and pose landmarks are being used as input to the very complex problem of sign language.

A time-lapse of the landmarks are used to generate tokens, which are then used as input for natural language processing.

The Datasets

The datasets for these competitions are now public on the Kaggle platform.

Kaggle datasets can be directly downloaded from the website with a valid user account. Kaggle also provides programming API and access keys that can be used to download datasets programmatically, as follows:

kaggle competitions download -c asl-signs
kaggle competitions download -c asl-fingerspelling

It is important to know the size of the datasets, before you attempt to download them. Make sure you have enough room for the archives and extracted data.

If you do not have enough room the datasets, you can still view the landmarks from a USB camera.

asl-signs

archive size : 40GB
extracted size : 57GB
link : https://www.kaggle.com/competitions/asl-signs/data

asl-fingerspelling

archive size : 170GB
extracted size : 190GB
link : https://www.kaggle.com/competitions/asl-fingerspelling/data

landmarks

The two datasets do not contain any images or video, but rather landmarks captured with the MediaPipe models. There is a total of 543 landmarks for each sample, including:

face : 468 landmarks
left hand : 21 landmarks
right hand : 21 landmarks
pose : 33 landmarks

When landmarks are absent/missing, they are represented as NaN in the datasets.

We can approximate the number of samples in the datasets as follows:

total size = 190+57GB = 247GB
size of one sample = 543 landmarks * 8B/landmark = 4344B/sample
samples = (247*1024*1024*1024)/4344= 61 million samples

If we played this content back at 10fps, this would take 6 millions seconds, or 70 days !

6.1 million seconds / (60sec/min) * (60min/hr) * (24hr/day) = 70 days

This clearly is a lot of data !

One important thing to know, is that up to 10% of the datasets contain garbage (useless) data, according to the feedback from participants. Often, the hand landmarks are missing from the data samples.

After reading the various notebooks from the competition participants, I discovered that the datasets were captured with an application using a selfie camera.

One thing that can be assumed from the data is that the participants were often holding their mobile phone with one hand, thus their recordings have only one hand. This has yet to be proven, since some signs require the use of two hands.

Since the landmarks are normalized, there is no way of knowing the original image/video resolution. The aspect ratio, although not documented, is important in my opinion, so I tried to determine this for each of the datasets.

After viewing the landmarks from each dataset, I established that the aspect ratio for each dataset could be the following:

asl-signs : 3:4
asl-fingerspelling : 9:16

The frame rate of the datasets seems to be approximately 10 frames per second. I have implemented the following frame rates in my viewers:

asl-signs : 20fps (ie. 50ms)
asl-fingerspelling : 10fps (ie. 100ms)

Feel free to change any of these values (aspect ratio, frame rate, etc...) in the code.

Time Lapse of Landmarks

All of the viewers have a time-lapse of the top 130 landmarks.

Time Lapse of Top 130 Landmarks (📷: AlbertaBeef)

The choice of the 130 landmarks can be tracked down to the winner of the first competition, Hoyeol Sohn (https://www.kaggle.com/hoyso48), and corresponds to the following landmarks:

face (lips, nose, left eye, right eye) : 76 landmarks
left hand : 21 landmarks
right hand : 21 landmarks
pose (left arm, right arm) : 12 landmarks

The winners of the second competition, Darragh Hanley (https://www.kaggle.com/darraghdog) and Christof Henkel (https://www.kaggle.com/christofhenkel), reused the same choice of 130 landmarks.

I think we can assume that this is a good and relevant selection for analysis and further processing.

Installing the Viewers

The dataset viewers can be accessed from the following github repository:

git clone https://github.com/AlbertaBeef/aslr_exploration
cd aslr_exploration

If not done so already, download the datasets from to the Kaggle website to the "aslr_exploration" directory, or via the Kaggle API, as follows:

kaggle competitions download -c asl-signs
kaggle competitions download -c asl-fingerspelling

Extract the "asl-signs" dataset as follows:

mkdir asl-signs
cd asl-signs
unzip ../asl-signs.zip
cd ..

Extract the "asl-fingerspelling" dataset as follows:

mkdir asl-fingerspelling
cd asl-fingerspelling
unzip ../asl-fingerspelling.zip
cd ..

You are all set !

Viewing the "asl-signs" Dataset

The "asl-signs" dataset contains sequences from various participants for 250 words.

The "asl-signs" dataset is provided in the following format:

train (94, 481 total sequences)

train.csv
train_landmark_files\[participant_id]\[sequence_id].parquet

The parquet files have the following format:

frame : int16 (frame number in sequence)
row_id : string (unique identifer, descriptor)
type : string (face, pose, left_hand, right_hand)
landmark_idx : int16
x/y/z : double

Each sample corresponds to 1, 629 rows (543 landmarks * 3) having a common frame value in the parquet file.

The "asl-signs" viewer can be launched as follows:

python3 asl_signs_viewer.py

asl-signs Viewer (📹: AlbertaBeef)

Viewing the "asl-fingerspelling" Dataset

The "asl-fingerspelling" dataset contains the sequences for various sentences performed by various participants. These sentences are typically explicitly spelled out (fingerspelling) since they contain numbers, addresses, web URLs, etc...

The "asl-fingerspelling" dataset is provided in the following format:

train (67, 213total sequences)

train.csv
train_landmarks\[participant_id]\[sequence_id].parquet

supplemental_metadata (52, 958 total sequences)

supplemental_metadata.csv
supplemental_landmarks\[participant_id]\[sequence_id].parquet

By default, the viewer will use the train.csv file. Feel free to edit the python script to use the supplemental_metadata.csv instead.

The parquet files have the following format:

sequence_id : int16 (unique identifier of sequence)
frame : string (frame number in sequence)
type : string (face, pose, left_hand, right_hand)
[x/y/z]_[type]_[landmark_idx] : double (1, 629 spatial coordinate columns for the x, y and z coordinates for each of the 543 landmarks, where type is one of face, pose, left_hand, right_hand)

Each sample corresponds to one row in the parquet file.

The "asl-fingerspelling" viewer can be launched as follows:

python3 asl_fingerspelling_viewer.py

asl-fingerspelling Viewer (📹: AlbertaBeef)

Viewing a "Live Feed"

It may be interesting to view custom data, or capture additional data for your custom applications. The MediaPipe holistic viewer can be used for this purpose.

The "MediaPipe holistic" viewer can be launched as follows:

python3 mediapipe_holistic_viewer.py

mediapipe holistic Viewer (📹: AlbertaBeef)

Conclusion

I hope that these viewers will inspire you to implement your own custom application.

What applications can you imagine that could leverage these datasets ?

Do you have interest in deploying the MediaPipe models to embedded platforms ?

If yes, which one ?

Let me know if the comments...

Acknowledgements

I want to thank Google (https://www.hackster.io/google) for making the following available publicly:

MediaPipe
asl-signs : ASL Isolated Signs Dataset
asl-fingerspelling : ASL Fingerspelling Dataset

I also want to thank Kaggle for hosting the competitions and datasets, as well as the following winning Kaggle members for their insight into these datasets: