Project GestureME aims to address the limitations and challenges associated with conventional smart home interfaces, particularly those reliant on voice commands or manual input methods. While voice-based assistants are popular and good, they may not always be practical, especially in noisy environments or for users with speech impairments.
Here is a gesture-based home assistant system that leverages the advanced capabilities of AMD Ryzen Processor. The build idea is to develop a robust gesture recognition system, capable of accurately detecting and interpreting user gestures in real-time utilizing accelerated computing to optimize the performance and efficiency of gesture recognition algorithms either standalone or using multi-modal LLMs on AMD Ryzen platforms. By introducing a gesture-based home assistant, GestureME seeks to provide a natural, intuitive, and accessible way for users to interact with their smart home devices and services compared to the existing solutions.
Need and BenefitsThe project addresses the following and is useful because of:
1. Accessibility: It offers an alternative interface that enables users to control their smart home devices using natural gestures and person recognition, improving accessibility for a broader range of users.
2. Privacy: It mitigates the concerns increasingly related to privacy and voice data leakage by providing a gesture-based interface that does not involve constant audio monitoring/wakeup words, thereby enhancing user privacy and security.
3. User Experience: Another problem aimed at is to smoothen the user experience by using vision models and acceleration in hardware for intuitive interaction method that responds promptly and quickly to user gestures, making the control more seamless and enjoyable.
The solution can also be integrated with existing smart home devices and platforms, ensuring compatibility and interoperability. Alongside, it implements customization features to allow users to define and personalize gesture commands according to their preferences and needs. Come on, let's make an AI that responds to you, your sign language if not another human ;)
Implementation1. Hardware and Capabilities
The UM790 Pro hardware used in this project comes with AMD Ryzen 9 7940HS is a high-performance mobile processor. It offers 8 cores and 16 threads, with a maximum frequency of 5.2 GHz. Pairing with this, I am using a camera connected through USB for capturing the user gestures. The LLM runs locally within the processor using the LM Studio utility as interface.
2. LM Studio and using LLMs
LM studio helps to Discover, download, and run local LLMs. These can be of different types - for example, like GPT-3 and GPT-4, are capable of generating text without requiring explicit training data for a specific task, making them versatile and applicable to various language processing tasks. Other being the multimodal models that combine language processing with other modalities, such as images, audio, or video, enabling applications like visual question answering, image captioning, and more.
TestingAmong the several LLMs available at this time, I've tried a few major ones and found https://huggingface.co/cjpais/llava-1.6-mistral-7b-gguf to be the most reliable for this use case giving deterministic results and ends at a certain outcome. Others tried are:
- https://huggingface.co/xtuner/llava-phi-3-mini-gguf
- https://huggingface.co/nisten/obsidian-3b-multimodal-q6-gguf
which often spins around pointing to unnecessary details and ping-pongs itself with questions and answers. LLaVA is an open-source chatbot trained by fine-tuning LLM on multimodal instruction following data. It is an auto-regressive language model, based on the transformer architecture documented at https://llava-vl.github.io/
With the LM Studio allowing to run a downloaded model as a server, this is used in the project flow for developing and testing the project. The server is compatible with the OpenAI API, currently the most popular API for LLMs.
This capture is then fed into the API which takes it as an image input to the model, runs locally on the PC enabled by Ryzen AI and prints the results over the same jupyter terminal. This method is used to validate the project design flow anb further can be integrated into a standalone software. Currently the latency is also on the higher side, taking ~20 seconds to print the results. These lie on the scope for improvement pointer for me on this project.
Comments