Communication between deaf-mute individuals and others is largely restricted despite more digitized ways of communication. This is mainly explained by the significant difference in languages used by deaf-mute and non-deaf-mute individuals. Such a communication differs in the way the language is designed and the information is articulated.
Recognizing and generating spoken language is a largely studied field. However, enabling communication between deaf-mute and non-deaf-mute people involves, firstly, the recognition of articulated sign language and the translation to standard spoken language. Secondly, it requires the production of sign language based on spoken language.
Typically, such devices are useful if they operate in real-time to allow a seamless conversation between two individuals. In virtual communication, power requirements are less relevant. However, for in-person communication, a mobile device would be helpful that operates in a power-efficient manner. In contrast to server-based computation, on-device operation of such algorithms would be an option for people that are not willing to share their private communication online.
This work investigates options and challenges arising when designing near-real-time sign language recognition personal computers with aiming to help build communicative bridges between deaf-mute and non deaf-mute people in the future.
We port a system that recognizes sign language symbols (glosses) using a pre-trained model on the AMD hardware. Additionally, we connect it with a language transformer to generate using natural language.
MethodOur architecture consists of two blocks: 1) The sign recognition model extracts the visual features for classfying the glosses that correspond to a set of words. 2) These glosses are fed into a large language models that allows generating a complete sentence.
For the sign recognition model, we mainly follow Y. Chen (2022): Visual features are extracted using the visual encoder S3D (S. Xie, 2018). After classifying the glosses, the predicted glosses are extracted using our implementation of the connectionist temporal classification loss (A. Graves, 2006).
Instead of using a translation language model with a vision-language mapper as in Y. Chen (2022), we use the language transformer Llama 2 7B-chat (H. Touvron, 2023) due to its support for the NPU hardware. Additionally, we are interested in applying a language model that is potentially generally applicable and not restricted to the language of the given dataset.
For the deployment of our process chain, we develop an architecture which is depicted in Figure 1. Each module is executed in an independent process and communicates via TCP sockets. For a simplified implementation we leverage the Publisher-Subscriber Queues from ZeroMQ. The inference runs as described below:
Frames recorded by the webcam are read in via OpenCV and center cropped to reduce the computational load for the following recognition model. Because the vide action recognition model expects a tensor of stacked images, we accumulate video frames successively before sending them of to the queue. Depending on the selected model version, there are up to {40,60,80,120} images in each input tensor. Based on the transmitted input tensor (1,T,C,H,W) our recognition model generates a list of glosses. These glosses are then forwarded to the LLama2 translation model, which generates a full sentence. To give an impression on the real latencies, the recorded video streams is overlayed with both the glosses and the generated sentences as they are available.
EvaluationOur implementation is realized on the Minisforu UM970 Pro kindly provided by AMD. The recognition model is CNN-based and executed solely on the CPU via the Ryzen AI ONNX Execution Runtime. The LLama2 model is quantized and deployed on the NPU. Both neural networks are executed concurrently on the same CPU. As we aim at showcasing the capabilities of our sign language recognition model on the AMD hardware qualitatively, we do not perform extensive evaluation of the language generation part. We present two demonstrations of the system:
First, we use videos from the PHOENIX-Weather-2014T dataset (N. Camgoz, 2018) that our sign language recognition model was trained on. We showcase this in Video 1.
To process the video within a time duration that allows near-real-time communication, while not deteriorating the gloss classification performance, we process the input video chunk-wise and overlap consecutive frames. In this way, we avoid losing glosses at the intersection between two inference batches of the recognition model. In this case study, we overlap subsequent batches consisting of 80 frames by 15 frames.
The predicted glosses for each chunk are displayed in the video.
We explored various strategies to reduce the computation time of the Llama model. We used the official 4bit quantization supplied by AMD in the RyzenAI-SW repository. We improve the output of the model by pre-conditioning on the task of informing about the weather forecast.
The maximum number of tokens is one major variable considering the time efficiency of the language transformer. We observe that 30 output tokens result in reasonable sentences generated by the model in our scenario with a latency of 9s.
In our second interactive demo, we showcase our system in a real-world scenario in Video 2. This time, we make use of the end-to-end architecture depicted in Figure 1. We show the translation use case by gesticulating the words 'rain' and 'cloud', which is correctly identified by the model.
In this way, we show that our toolchain not only works on selected data sets, but can also be transferred to real applications. By using AMD hardware, we can achieve a performance that comes close to real-world requirements.
Weaknesses and Future WorkA major weakness of our sign recognition model is that the used model was only trained on a constrained domain subset, which does not capture all available sign glosses. In the future, extending the vocabulary size and improving the generalization capabibilities of the recognition model are of relevance. Better generalizability could be supported by integrating a human key point detector that supports the recognition module by providing more explicit information about the the human pose.
Adding a visual-language mapper that injects the visual information into LLama to reduce ambiguities in the language part is another important step towards an automated and reliable system that translates sign language to natural sentences.
For the hardware, we were not able to realize the recognition module on the NPU hardware due to missing support of the three-dimensional convolutional layers required by the video action recognition framework. Using the Rialto framework to implement custom such layers would be a necessary step towards full integration.
In the future, one component is missing to complete our vision of a complete an automated sign language communication system.
Since the code of Neural Sign Actors (V. Baltatzis, 2024) has not been published despite our expectations, we were not able to work on this component. Alternatively, the recently published work for sign language generation by R. Zuo (2024) could be used as an alternative for closing the loop for an interactive assistant realizing bi-directional sign language communication.
- Yutong Chen, Fangyun Wei, Xiao Sun, Zhirong Wu, and Stephen Lin. A simple multi-modality transfer learning baseline for sign language translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (ECCV), 2018.
- Alex Graves, Santiago Fern´andez, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, 2006.
- Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. Neural sign language translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- Vasileios Baltatzis, Rolandos Alexandros Potamias, Evangelos Ververas, Guanxiong Sun, Jiankang Deng, Stefanos Zafeiriou. Neural Sign Actors: A diffusion model for 3D sign language production from text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024.
- Zuo, Ronglai and Wei, Fangyun and Chen, Zenggui and Mak, Brian and Yang, Jiaolong and Tong, Xin. A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars. In Proceedings of the European conference on computer vision (ECCV), 2024.
- Hugo Touvron, Louis Martin, Kevin Stone et. al. Llama 2: Open Foundation and Fine-Tuned Chat Models, (arXiv) 2023
Comments