This project provides a deep dive exploration into the Kaggle datasets provided by Google for their 2023 competitions related to ASL recognition:
- Google - Isolated Sign Language Recognition
- Google - American Sign Language Fingerspelling Recognition
Since an image is worth a thousand words, a video is certainly worth even more. To that end, I created viewers for these two datasets, as well as a similar viewer for a live USB camera feed. I chose to display the landmarks generated by the MediaPipe framework (holistic model, including pose, face, hands), as well as a time lapse of the most important landmarks.
In the video, I have captured myself executing two styles of sign language:
- sign language (ie. words and phrases) : THANK YOU
- fingerspelling : A, V, N, E, T
The Kaggle datasets provide captured examples for each style. The latter is used for content that is not covered with the sign language itself, such as phone numbers, web URLs, etc...
The CompetitionsIn 2023, Google launched two open-source competitions on Kaggle, totaling $300, 000 in prizes.
Google - Isolated Sign Language Recognition
- https://www.kaggle.com/competitions/asl-signs/overview
- Feb-May 2023
- $100, 000 Prize
Google - American Sign Language Fingerspelling Recognition
- https://www.kaggle.com/competitions/asl-fingerspelling
- May-Aug 2023
- $200, 000 Prize
It is very interesting to analyze the winning solution for the Fingerspelling Recognition competition:
The input to the solution are the following subset of 130 landmarks:
- 21 key points from each hand
- 6 pose key points from each arm
- 76 from the face (lips, nose, eyes)
In other words, the output of the MediaPipe framework (holistic model), the hand, face, and pose landmarks are being used as input to the very complex problem of sign language.
A time-lapse of the landmarks are used to generate tokens, which are then used as input for natural language processing.
The DatasetsThe datasets for these competitions are now public on the Kaggle platform.
Kaggle datasets can be directly downloaded from the website with a valid user account. Kaggle also provides programming API and access keys that can be used to download datasets programmatically, as follows:
kaggle competitions download -c asl-signs
kaggle competitions download -c asl-fingerspelling
It is important to know the size of the datasets, before you attempt to download them. Make sure you have enough room for the archives and extracted data.
If you do not have enough room the datasets, you can still view the landmarks from a USB camera.
asl-signs
- archive size : 40GB
- extracted size : 57GB
- link : https://www.kaggle.com/competitions/asl-signs/data
asl-fingerspelling
- archive size : 170GB
- extracted size : 190GB
- link : https://www.kaggle.com/competitions/asl-fingerspelling/data
landmarks
The two datasets do not contain any images or video, but rather landmarks captured with the MediaPipe models. There is a total of 543 landmarks for each sample, including:
- face : 468 landmarks
- left hand : 21 landmarks
- right hand : 21 landmarks
- pose : 33 landmarks
When landmarks are absent/missing, they are represented as NaN in the datasets.
We can approximate the number of samples in the datasets as follows:
- total size = 190+57GB = 247GB
- size of one sample = 543 landmarks * 8B/landmark = 4344B/sample
- samples = (247*1024*1024*1024)/4344= 61 million samples
If we played this content back at 10fps, this would take 6 millions seconds, or 70 days !
- 6.1 million seconds / (60sec/min) * (60min/hr) * (24hr/day) = 70 days
This clearly is a lot of data !
One important thing to know, is that up to 10% of the datasets contain garbage (useless) data, according to the feedback from participants. Often, the hand landmarks are missing from the data samples.
After reading the various notebooks from the competition participants, I discovered that the datasets were captured with an application using a selfie camera.
One thing that can be assumed from the data is that the participants were often holding their mobile phone with one hand, thus their recordings have only one hand. This has yet to be proven, since some signs require the use of two hands.
Since the landmarks are normalized, there is no way of knowing the original image/video resolution. The aspect ratio, although not documented, is important in my opinion, so I tried to determine this for each of the datasets.
After viewing the landmarks from each dataset, I established that the aspect ratio for each dataset could be the following:
- asl-signs : 3:4
- asl-fingerspelling : 9:16
The frame rate of the datasets seems to be approximately 10 frames per second. I have implemented the following frame rates in my viewers:
- asl-signs : 20fps (ie. 50ms)
- asl-fingerspelling : 10fps (ie. 100ms)
Feel free to change any of these values (aspect ratio, frame rate, etc...) in the code.
Time Lapse of LandmarksAll of the viewers have a time-lapse of the top 130 landmarks.
The choice of the 130 landmarks can be tracked down to the winner of the first competition, Hoyeol Sohn (https://www.kaggle.com/hoyso48), and corresponds to the following landmarks:
- face (lips, nose, left eye, right eye) : 76 landmarks
- left hand : 21 landmarks
- right hand : 21 landmarks
- pose (left arm, right arm) : 12 landmarks
The winners of the second competition, Darragh Hanley (https://www.kaggle.com/darraghdog) and Christof Henkel (https://www.kaggle.com/christofhenkel), reused the same choice of 130 landmarks.
I think we can assume that this is a good and relevant selection for analysis and further processing.
Installing the ViewersThe dataset viewers can be accessed from the following github repository:
git clone https://github.com/AlbertaBeef/aslr_exploration
cd aslr_exploration
If not done so already, download the datasets from to the Kaggle website to the "aslr_exploration" directory, or via the Kaggle API, as follows:
kaggle competitions download -c asl-signs
kaggle competitions download -c asl-fingerspelling
Extract the "asl-signs" dataset as follows:
mkdir asl-signs
cd asl-signs
unzip ../asl-signs.zip
cd ..
Extract the "asl-fingerspelling" dataset as follows:
mkdir asl-fingerspelling
cd asl-fingerspelling
unzip ../asl-fingerspelling.zip
cd ..
You are all set !
Viewing the "asl-signs" DatasetThe "asl-signs" dataset contains sequences from various participants for 250 words.
The "asl-signs" dataset is provided in the following format:
train (94, 481 total sequences)
- train.csv
- train_landmark_files\[participant_id]\[sequence_id].parquet
The parquet files have the following format:
- frame : int16 (frame number in sequence)
- row_id : string (unique identifer, descriptor)
- type : string (face, pose, left_hand, right_hand)
- landmark_idx : int16
- x/y/z : double
Each sample corresponds to 1, 629 rows (543 landmarks * 3) having a common frame value in the parquet file.
The "asl-signs" viewer can be launched as follows:
python3 asl_signs_viewer.py
The "asl-fingerspelling" dataset contains the sequences for various sentences performed by various participants. These sentences are typically explicitly spelled out (fingerspelling) since they contain numbers, addresses, web URLs, etc...
The "asl-fingerspelling" dataset is provided in the following format:
train (67, 213total sequences)
- train.csv
- train_landmarks\[participant_id]\[sequence_id].parquet
supplemental_metadata (52, 958 total sequences)
- supplemental_metadata.csv
- supplemental_landmarks\[participant_id]\[sequence_id].parquet
By default, the viewer will use the train.csv file. Feel free to edit the python script to use the supplemental_metadata.csv instead.
The parquet files have the following format:
- sequence_id : int16 (unique identifier of sequence)
- frame : string (frame number in sequence)
- type : string (face, pose, left_hand, right_hand)
- [x/y/z]_[type]_[landmark_idx] : double (1, 629 spatial coordinate columns for the x, y and z coordinates for each of the 543 landmarks, where type is one of face, pose, left_hand, right_hand)
Each sample corresponds to one row in the parquet file.
The "asl-fingerspelling" viewer can be launched as follows:
python3 asl_fingerspelling_viewer.py
It may be interesting to view custom data, or capture additional data for your custom applications. The MediaPipe holistic viewer can be used for this purpose.
The "MediaPipe holistic" viewer can be launched as follows:
python3 mediapipe_holistic_viewer.py
I hope that these viewers will inspire you to implement your own custom application.
What applications can you imagine that could leverage these datasets ?
Do you have interest in deploying the MediaPipe models to embedded platforms ?
If yes, which one ?
Let me know if the comments...
AcknowledgementsI want to thank Google (https://www.hackster.io/google) for making the following available publicly:
- MediaPipe
- asl-signs : ASL Isolated Signs Dataset
- asl-fingerspelling : ASL Fingerspelling Dataset
I also want to thank Kaggle for hosting the competitions and datasets, as well as the following winning Kaggle members for their insight into these datasets:
- Hoyeol Sohn (https://www.kaggle.com/hoyso48)
- Darragh Hanley (https://www.kaggle.com/darraghdog)
- Christof Henkel (https://www.kaggle.com/christofhenkel)
- 2024/08/12 - Initial Version
- [Kaggle] Isolated Sign Language Recognition : https://www.kaggle.com/competitions/asl-signs/overview
- [Kaggle] Fingerspelling Recognition : https://www.kaggle.com/competitions/asl-fingerspelling
- [AlbertaBeef] aslr_exploration : https://github.com/AlbertaBeef/aslr_exploration
Comments