According to the World Health Organization (WHO), Indonesia is one of the four countries in Asia with a high prevalence of hearing disorders. Despite the significant number of individuals with hearing impairments, there is a lack of understanding and proficiency in sign language, particularly in critical settings such as hospitals, where communication barriers can severely impact the quality of care. Additionally, the diversity of sign languages worldwide further complicates the situation, underscoring the necessity for specialized sign language interpreters who are fluent in Indonesian Sign Language (BISINDO). This highlights the urgent need for increased awareness, education, and resources to support the deaf community in Indonesia, ensuring they have equal access to essential services and opportunities.
What is BISINDO?Bisindo, short for Bahasa Isyarat Indonesia, is the Indonesian Sign Language. It serves as a vital means of communication for the deaf and hard of hearing community in Indonesia. Similar to other sign languages, Bisindo relies on hand gestures, facial expressions, and body movements to convey meaning and express ideas.
Bisindo has its own distinct grammar and syntax, with a vocabulary that encompasses a wide range of concepts, emotions, and actions. It is used in various settings, including educational institutions, workplaces, and social interactions, enabling deaf individuals to communicate effectively with both other deaf individuals and hearing individuals who have learned the language.
In this project, it's important to note that we will be focusing on a specific problem domain, limiting our scope to the recognition of hand gestures corresponding to letters of the alphabet. This means that our dataset, model architecture, and testing procedures will be tailored specifically to the task of recognizing and classifying hand gestures representing individual letters of the alphabet.
By imposing this limitation, we can streamline the development process and ensure that our resources are effectively utilized towards achieving our research objectives. Additionally, this focused approach allows for a more targeted investigation into the intricacies of hand gesture recognition for alphabet letters, potentially leading to more precise and reliable results.
SolutionThe proposed solution, BisindoMate - Indonesian Sign Language Translator, is designed to bridge the communication gap for individuals with hearing impairments in Indonesia. This system leverages AMD Ryzen AI PCs and a custom-built dataset tailored specifically to Indonesian Sign Language (BISINDO). BisindoMate provides real-time translation, facilitating effective communication between deaf individuals and healthcare professionals. For users, this means greater independence and improved access to essential services. Doctors can deliver better care by understanding their patients' needs without language barriers. For the government, implementing BisindoMate enhances public service inclusivity and aligns with health and disability rights initiatives, ultimately promoting a more equitable society.
How It WorksThe proposed solution is to create a classification system for Bisindo using hand landmarks. Hand landmarks are specific points on the hand that can be identified and tracked to understand hand movements and gestures. These points typically include key locations such as fingertips, knuckles, and the base of the fingers, which are used in gesture recognition and computer vision technologies to accurately interpret hand positions and motions.
We can collect data in the form of images of hand landmarks for the letters of Bisindo. Then, we can develop a model to classify each letter pattern. In this project, we will use TensorFlow to classify the collected data.
Once a sufficiently accurate model is obtained, the next step is to deploy it. During the deployment process, the model will be converted to ONNX format using Vitis AI so that it can run on the Venus UM790 Pro.
Process and Steps1. Preparing Venus UM790 Pro2. Camera Testing3. Data Collection4. Data Preparation5. Training Model6. Testing Model7. Quantization and Compilation for Deployment with Vitis AI8. Deployment9. Testing
1. Preparing Venus UM790 ProThe UM790 Pro is a compact yet powerful mini-PC developed by Minisforum. It features AMD’s Ryzen 9 7940HS processor, which offers high performance with efficient power consumption. This mini-PC is designed for versatile use, catering to both professional and casual computing needs. It comes equipped with a robust cooling system to maintain optimal performance and prevent overheating. The UM790 Pro supports multiple connectivity options, including USB-C, HDMI, and DisplayPort, allowing for the connection of various peripherals and multiple monitors. It is also noted for its sleek, modern design and user-friendly setup, making it an excellent choice for those seeking a space-saving yet capable desktop solution.
The first part that needs to be done is to prepare the system. We need to ensure that we have installed the NPU driver on the mini PC that will be used. The next step is to install several necessary dependencies, including Visual Studio, CMake, Python, and Anaconda or Miniconda. Finally, you need to install the Ryzen AI Software. For more details, you can refer to the official guide from AMD or watch the video below.
note:If you cannot find the AMD IPU Device, you need to enable it first. Refer to this document.If you are using Visual Studio 2022, you need to edit the description in install.bat.
reg query "HKEY_LOCAL_MACHINE\SOFTWARE\WOW6432Node\Microsoft\VisualStudio\16.0" >nul 2>&1
to
reg query "HKEY_LOCAL_MACHINE\SOFTWARE\WOW6432Node\Microsoft\VisualStudio\17.0" >nul 2>&1
When all the processes have been completed, we can directly try to run the model in the Conda work environment that we created earlier. This time, a custom name "rifqi" will be used. Here are the results of a quick test conducted on the Venus UM790 Pro for the program quicktest.py, showing "Test Passed, " indicating that the model is running well. This indicates that the model is running on the NPU and the installation of the Ryzen AI Software was successful.
The testing of the camera involves a crucial preliminary step in the dataset creation process. Initially, it is imperative to ensure the seamless functionality of the camera connected to the designated hardware, such as the Venus UM790 Pro mini PC. This verification process ensures that the camera can effectively capture images with adequate clarity and resolution, essential for accurate hand landmark detection.
3. Data CollectionIn this project, the Python programming language is utilized for its versatility and extensive libraries. Specifically, the libraries OpenCV, MediaPipe, and NumPy are employed for efficient data processing. OpenCV serves a crucial role in facilitating various image processing tasks, including image capture, manipulation, and feature extraction. MediaPipe is employed for hand tracking and generating hand landmarks, whereas NumPy is utilized for numerical computations and array manipulation.
Dataset creation begins with the identification of hand landmarks, followed by the creation of a new frame and the overlaying of the detected hand landmarks onto this frame. Subsequently, cropping is performed to ensure that the focus of each captured image is solely on the hand landmarks. To ensure uniformity across all images, a standardized approach is adopted, whereby the background of the frame is set to white, and the dots and connections are rendered in black.
In this section, we will divide the data into training data and validation data. This aims to ensure that the model can learn from the training data and be tested on the validation data, allowing us to evaluate its performance more accurately before it is used on unseen data.
We will split the data into training and validation sets with an 80:20 ratio. Specifically, 80% of the available hand landmarks image data will be allocated for training the model, while the remaining 20% will be used for validation. This allocation ensures that the model can learn from the training data and be tested on the validation data, providing a more accurate evaluation of its performance before it is applied to new, unseen data.
5. Training ModelIn this project, models are built using TensorFlow Keras, a high-level TensorFlow API that simplifies building and training deep learning models. Keras was chosen for its ease of use and flexibility, making it an excellent choice for both beginners and experienced practitioners. Its user-friendly interface enables rapid prototyping and direct implementation of complex neural networks. Additionally, Keras integrates seamlessly with TensorFlow, providing access to advanced computing resources and optimization techniques, which are essential for developing efficient and robust machine learning models. Leveraging Keras, the project aims to efficiently classify images into four different categories, leveraging a series of convolutional and dense layers to achieve high accuracy and performance. The following is the code used along with the explanation.
from tensorflow.keras.models import Sequential
from tensorflow import keras
from tensorflow.keras import layers
num_classes = 4
model = Sequential([
layers.experimental.preprocessing.Rescaling(1./255, input_shape=(150, 150, 3)),
layers.Conv2D(32, 3, activation='relu'),
layers.MaxPooling2D(pool_size=(2, 2)),
layers.Conv2D(64, 3, activation='relu'),
layers.MaxPooling2D(pool_size=(2, 2)),
layers.Conv2D(128, 3, activation='relu'),
layers.MaxPooling2D(pool_size=(2, 2)),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dropout(0.5),
layers.Dense(num_classes)
])
The first step in this model is adding a Rescaling layer, which scales the pixel values from 0-255 to 0-1. This is done to make it easier for the model to learn from the data. This layer also specifies the input shape of the images, which is 150x150 pixels with 3 color channels (RGB).
Next, we add several Convolutional and MaxPooling layers. The Convolutional layers (Conv2D) are used to detect important features in the images like edges, corners, and textures by using 32 filters in the first layer, 64 filters in the second layer, and 128 filters in the third layer, each with a kernel size of 3x3. Each Convolutional layer uses the ReLU activation function, which helps the model handle non-linearity well. MaxPooling2D layers with a size of 2x2 are used after each Convolutional layer to reduce the size of the image, thereby reducing the number of parameters and computations and helping to avoid overfitting.
After passing through a series of Convolutional and MaxPooling layers, we "flatten" the result into one dimension so it can be connected to dense layers. The first dense layer has 128 neurons with the ReLU activation function, which provides the ability to learn more complex feature combinations. Additionally, there is a Dropout layer with a dropout rate of 50% to prevent overfitting by randomly turning off 50% of the neurons during training.
The last layer is a dense layer with the number of neurons equal to the number of classes (4), which will output scores for each class.
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
After building the model, we compile it using the Adam optimizer, which is efficient and popular. We also use the SparseCategoricalCrossentropy loss function, suitable for multi-class classification problems with integer labels. We also track the accuracy metric during the training process.
class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if(logs.get('val_accuracy')>0.99):
print("\nAkurasi telah mencapai > 99%!")
self.model.stop_training = True
callbacks = myCallback()
Additionally, we create a custom callback named myCallback to stop training if the validation accuracy exceeds 99%. This helps save time and resources if the model achieves the desired performance before the specified number of epochs is completed.
epochs=5
history = model.fit(
train_ds,
validation_data=val_ds,
# callbacks=[callbacks],
epochs=epochs
)
Finally, we train the model with the training data (train_ds) and validation data (val_ds) for 5 epochs, monitoring the validation accuracy. The created callback can be used to stop training early if the desired condition is met, but in this code, the callback is not activated (commented out).
This is how this TensorFlow Keras code works to build and train a model to classify images into four different categories.
6. Testing ModelAfter the model is built using TensorFlow Keras, the next step is to perform testing to evaluate how well the model performs. The testing is conducted by examining two main parameters: loss and accuracy. Loss indicates the extent of error in the model's predictions compared to the actual values. The smaller the loss value, the better the model performs in making accurate predictions. Accuracy, on the other hand, is the percentage of correct predictions out of the total predictions made by the model. The higher the accuracy value, the better the model is at correctly classifying the data.
There are two main categories in this testing: loss and accuracy for the training data and loss and accuracy for the validation data. Loss and accuracy for the training data indicate how well the model learns from the training data, meaning these values reflect the model's performance on the data used to train it. Meanwhile, loss and accuracy for the validation data show how well the model can generalize its knowledge to new, unseen data. This means these values reflect the model's performance on the data used to test its ability to accurately predict outside the training data.
From the testing results, we obtained loss: 0.0057 - accuracy: 0.9961 - val_loss: 0.0000e+00 - val_accuracy: 1.0000, indicating that the model has a very low error rate and very high accuracy on the training data. Additionally, the model demonstrates perfect performance on the validation data, with no prediction errors and 100% accuracy. This indicates that the model not only learns well from the training data but also generalizes exceptionally well to unseen data.
Mlso conducted initial testing using images we collected separately (not part of the training or validation data), testing these images randomly. The results of this testing were very promising, with the model correctly classifying all 9 out of 9 images tested.Model
The model we have built is considered to be highly effective, so the next step is the process of Quantization and Compilation for Deployment with Vitis AI. In this process, the model will be converted and saved into the ONNX format. We perform this conversion using the tf2onnx library available in the Python programming language.
Quantization is the process of reducing the precision of the model's weights and activations, which helps in optimizing the model for deployment on hardware with limited resources, such as edge devices. By converting the model to the ONNX format, we ensure compatibility with various platforms and frameworks, making it easier to deploy and integrate into different environments.
Compilation with Vitis AI involves optimizing the model further to leverage the specific capabilities of the target hardware, enhancing performance and efficiency. This step is crucial for deploying deep learning models in real-world applications where computational resources and power consumption are critical factors.
# Convert and save the model to ONNX format
import tf2onnx
# Define the ONNX conversion
spec = (tf.TensorSpec((None, 150, 150, 3), tf.float32, name="input"),)
model_proto, _ = tf2onnx.convert.from_keras(model, input_signature=spec, opset=13)
output_path = "model.onnx"
with open(output_path, "wb") as f:
f.write(model_proto.SerializeToString())
print(f"Model has been saved to {output_path}")
To test the model, we followed the same steps outlined in the official guide from AMD, with slight modifications to run the program using TensorFlow. We observed that the model performed well in this environment. By adhering to the official guidelines, we ensured that our implementation was robust and aligned with industry standards. The slight adjustments for TensorFlow integration were necessary to accommodate the specific requirements of our project and hardware setup. These modifications included optimizing the TensorFlow runtime and ensuring compatibility with the Vitis AI framework.
Our testing confirmed that the model could execute efficiently and accurately, demonstrating its readiness for deployment. This process not only validated the model's performance but also highlighted its adaptability to different deployment scenarios, ensuring it can be effectively utilized in practical applications.
The final process is deployment, this deployment process is carried out by combining all the processes that have been carried out previously. First, the MediaPipe hands module is initialized to detect and track hand landmarks in real-time. A directory structure for saving datasets is created if it does not already exist. An ONNX model is loaded and an inference session is started using ONNX Runtime with the CPU as the execution provider. The webcam is opened to capture live video frames, which are converted to RGB format for processing.
Each frame is processed to detect hands and draw landmarks on both the original and a white background image. If hands are detected, the landmarks are used to calculate a bounding box around the hand, which is then cropped and resized to 150x150 pixels. The cropped image is normalized and passed through the ONNX model to classify the gesture. The predicted gesture, mapped to specific letters, is displayed on the original image.
# Perform inference with the ONNX model
outputs = session.run(None, {'input': x}) # Adjust input name if necessary
classes = outputs[0][0]
# print(classes)
# Map model outputs to the corresponding letters
predicted_letter = None
# Uncomment and adjust the following lines based on your model's output and thresholds
if classes[0] > 0.60:
predicted_letter = "F"
elif classes[1] > 0.98:
predicted_letter = "I"
elif classes[2] > 0.95:
predicted_letter = "Q"
elif classes[3] > -0.35:
predicted_letter = "R"
if predicted_letter:
cv2.putText(image, predicted_letter, (20, 100), cv2.FONT_HERSHEY_SIMPLEX, 3, (0, 0, 255), 5, cv2.LINE_AA)
We determine the thresholds as the limits for each class, where for class F, the prediction is considered if the model's output value is greater than 0.60. For class I, the threshold is set at a value greater than 0.98. For class Q, the threshold is greater than 0.95, and for class R, the threshold is greater than -0.35.
These thresholds are chosen based on model evaluation to ensure high accuracy in classifying each letter. When the model performs inference, the output values are compared against these thresholds to determine the correct letter. If the predictions meet the specified threshold criteria, the corresponding letter is displayed on the image captured from the webcam. The predicted letter is added to the image using the `cv2.putText` function, which places the letter at a specified position in the image with a defined font size and color.
By adjusting these thresholds, we can control the model's sensitivity and specificity, ensuring that the predictions shown truly represent the intended hand gestures. This is a crucial step in optimizing the model's performance in real-world environments and ensuring the reliability of the gesture recognition system.
9. TestingClosingBisindoMate is created to translate Indonesian Sign Language (Bisindo). BisindoMate is designed to assist people with disabilities who have hearing impairments to communicate better. The developer hopes that BisindoMate can serve as a bridge between the deaf and the general community.
The developer hopes that BisindoMate can be used in many public places in Indonesia. In the future, the developer hopes to add several features, such as expanding translations not only for the alphabet but also for commonly used everyday words. This can be achieved when the system utilizes not only hand landmarks but also body landmarks. The developer also hopes to receive support from the community, government, and related institutions or groups, so that we can increase our concern for the disabled community.
The Indonesian nation has the motto "Bhinneka Tunggal Ika, " which means unity in diversity. With this motto, we believe that although the Indonesian nation has cultural, religious, and linguistic diversity, including physical differences such as disabilities, we will still help each other and unite.
Meet Our Teams!SEE YOU ON THE NEXT PROJECT! :)
Comments