Please Sign Here
FAU's real-time American Sign Language translator relies on the accuracy and speed of YOLOv8 and MediaPipe to break communications barriers.
For the deaf and hard of hearing, sign language opens up a world of communication that would otherwise be impossible. The hand movements, facial gestures, and body language used when signing is highly expressive, and it enables people to convey complex ideas with a great deal of nuance. However, relatively few people understand sign language, which creates communication barriers for those that rely on it. Furthermore, there are many different sign languages used around the world, and the versions are every bit as different as spoken languages.
A translator would go a long way toward solving this problem, as it would take away the substantial burdens that come with learning sign language (or many sign languages!). Wearable gloves and other motion sensing devices have been experimented with in the past, but these systems tend to be complex and impractical for daily use in the real world. But just recently, a small team of engineers at Florida Atlantic University has reported on their work that could ultimately be used to power a more practical sign language translation device.
The team developed a new computer vision-based approach to recognize the American Sign Language (ASL) alphabet in real time. They began by collecting a dataset consisting of 29,820 images of people making American Sign Language hand gestures. MediaPipe, an open-source framework often used for hand landmark tracking, was then leveraged to annotate 21 key points on the hands to complete this dataset.
For the next phase of the project, YOLOv8 β a state-of-the-art object detection model β was selected for its speed and accuracy, which make it a good fit for this real-time application. The YOLOv8 model was then fine-tuned via transfer training, using the newly-compiled ASL hand gesture dataset. The key point data generated by MediaPipe proved to be instrumental in helping YOLOv8 detect subtle differences in hand gestures, but the integration did not stop there. MediaPipe key points were also included in the inferencing pipeline, along with YOLOv8 object detections, to provide more accurate and robust results than previous systems.
During an evaluation, the system demonstrated exceptional performance across key metrics. The model achieved a precision of 98 percent, indicating that nearly all predictions made were correct, while a recall rate of 98.5 percent showed its ability to identify the majority of actual instances. The F1 score, a balance of precision and recall, reached an impressive 99 percent, showcasing the system's robustness and reliability.
The system also excelled in real-time performance, achieving an average detection time of 0.0849 seconds per image when running on the CPU of a typical desktop computer. Live testing using a webcam and MediaPipe for hand tracking demonstrated smooth and accurate gesture recognition, hinting at the model's applicability for real-world use cases.
Unlike most existing systems, which often trade accuracy for speed, this approach delivered high precision with minimal latency, making it suitable for use in real-time communication aids. To date, it can only translate the ASL alphabet, but extending that should be a matter of little more than collecting a larger and more diverse training dataset.