This project was designed with a specific goal in mind- To assist the visually impaired in their cooking tasks.
This idea came to mind when I was watching an animated movie based on a certain talented rat, who's culinary skills assisted the lead to cook 5 star cuisine.
I thought to myself, why not? Why not implement a similar device to help us with our daily cooking tasks? This could be of high use to those with visual impairments, just getting started on cooking tasks. This could help developing kitchens to be a more inclusive space for everyone!
Main Features of CHEF-VISION:
- Integration with Bluetooth headsets
- Detection of Ingredients, informs where they are located and how many there are
- Detection of utensils and cutlery
- Reading of labels on food items
- Knife Safety
- Level of Cooking
The overall system is split into two parts
1 - The Wi-Fi camera hat
2- The Rpi- microphone system
The user would wear the chef's hat, which has a camera integrated into it, which connects to the rpi device
"The Esp32 wirelessly streams data (video/images) to the Raspberry Pi"
The user will speak his/her commands, which would be picked up by the microphone. The device processes this command, doing the classification etc based on the service requested and audio feedback is sent back to the user (either via Bluetooth headphone or speaker).Here are the features implemented by this device
1)Ingredient classification- VISION CHEF classifies the ingredient, tells the user which ingredients are available in front of them, and where they are located(i.e top-right, left) to enable ease of use and identification.
2) Utensil Classification- Same thing as ingredient classification, just classifies utensils like pots, knifes etc.
3) Label Reading - This is crucial in cooking, those with limited visibility wont be able to make out the difference in certain items as they may be in similar containers, but differently labelled.
4) Knife Safety - Will alert the user when a sharp object ( in this case a knife ) is detected.
5) Read out Recipes- Will read out recipes either step by step, or from a certain step.
6) Determine how well cooked an item is - Classifies whether a cooked object is well done, raw or overdone!
7) Emergency Alerts - Alert trusted personnel in the case of an emergency via SMS
What's Under the Hood?Since we are working with a lot of processing power and intensive Machine Learning and Computer vision models, I have decided to go with a Raspberry Pi 4 as it is perfectly capable of handling all the ML processing as well as the Text to speech and Speech to Text bit.
The Raspberry Pi system is housed in a black case with a fan to cool it. The USB microphone is attached to this system. For the purpose of demonstration I'll hook it up to a monitor display. But the device doesn't need one as we will boot this program on start.
The Pi can be connected to a wall socket, but we can connect it to a battery pack itself, if you want it to be more portable.For the camera which livestreams, we are going with the Xiao Esp32 S3 Sense. Perfectly suited for this task with it's extremely compact nature and livestreaming capabilities.
The Xiao Esp32 S3 will be connected to a battery and attached to the Chef's Hat, which would be easy to wear. For this demo I'll also be attaching it to a battery pack.
We will also be using Blues' Notecard and notecarrier for sending emergency messages.
Code Review! :1) Firmware for the Xiao esp32 sense
We will be using the CameraWebServer Firmware example code which will be available in the Arduino IDE, but, we may have to tweak it a bit
- Make sure to comment out the serial.println() statements, as they may block out the camera webserver communication if there is no serial port to communicate to, this would not work when we are going to connect our Xiao esp32 sense to a DC battery.
- Leave the Serial.print("Camera Ready! Use 'http://") and Serial.print(WiFi.localIP()) as these don't block the streaming, and we need these for extracting the device IP address.
- Connect your Xiao esp32 to the laptop and flash the firmware
Testing of XIAO ESP32 S3 Sense Camera -
- Open Serial monitor, the IP address should display
- Copy paste this IP into your browser (wait a good 10 seconds for it to respond)
- You should see this interface pop up " interface image"
- Click on start server- you should see a livestream from your device
Your camera module is good to go!
2) Raspberry Pi Code( Micro Python)(Note: Try to install libraries and run this code from a virtual environment in the Rpi)
Main.py
- The main control panel ! This is where the commands given by the user are processed, and the functions are called from here.
- We use these libraries in this page
import threading
import speech_recognition as sr
import pyttsx3
threading - we use a mutex to prevent race conditions and blocking by the voice recognition process( Our system would be utterly chaotic without this)
mutex = threading.Lock()
speech_recognition- the engine which classifies our speech to text
pyttsx3- text to speech library
- Calling all functions(features) from other files
from label import label_loop
from ingredient_class import classify_image
from utensil_class import classify_image_u
from knife_safety import process_video
from cooked_level import cook_classify
- recognize_speech() recognizes the command and returns it
def recognize_speech():
"""Capture and recognize speech using the microphone."""
with sr.Microphone() as source:
print("Listening...")
audio = recognizer.listen(source, timeout=3, phrase_time_limit=5)
try:
command = recognizer.recognize_google(audio)
print(f"You said: {command}")
return command.lower()
except sr.UnknownValueError:
print("Sorry, I did not understand that.")
return ""
except sr.RequestError:
print("Sorry, the service is down.")
return ""
except sr.WaitTimeoutError:
print("Listening timed out.")
return ""
- In the perform_task() function it calls respective functions, based on the command it receives.(You can customize this to add your own functions if you want :))
def perform_task(command):
"""Perform tasks based on recognized speech."""
with mutex: # Acquire the mutex lock
if "label" in command:
label_loop() # This will block the main loop until it finishes
elif "vegetable" in command:
classify_image()
elif "safety" in command:
process_video()
elif "utensil" in command:
classify_image_u()
else:
speak("Sorry, I don't know how to do that.")
- In the main function, is where the key word is called, I've kept it as "Hello buddy" for now, you can change it to anything you wish
def main():
"""Main function to run the voice assistant."""
while True:
command = recognize_speech()
if "hello buddy" in command:
speak("Hello! What would you like me to help with?")
""" Wait for a specific function command after "Hello Buddy"""
task_command = recognize_speech()
if task_command:
perform_task(task_command)
""" Optional: Speak after the task is completed"""
speak("Done!. What would you like to do next?")
This starts the process, it will ask for a command only after it hears the keyword- kind of like Alexa
veg_class.py
This function handles image classification of ingredients and where they are located
The models are trained using the help of edge impulse. For this project, I have taken photos using my Xiao Esp32 sense. So that detection would be done better as the training device is same as the one we will be using.
We will be using a CNN for image processingAlgorithm used is FOMO
- We use the cv2 library, pyttsx3 and the edge_impulse_linux library, as our model is trained and deployed with the help of edge impulse
import cv2
import pyttsx3
from edge_impulse_linux.image import ImageImpulseRunner
- Make sure to update your camera_url( your esp32 IP address) and model path( wherever it is saved in the RPi).
camera_url = "http://192.xxx.x.xx:<port_no>/stream"
model_path = "/home/devicename/your_model_name"
- The function does the following
1.Captures a frame from the camera
2.Does classification, it determines the bounding boxes
3.Output will be a text to speech, which says which ingredient was detected, and where it is located
4.Location is done with the help of bounding box coordinates
(Initially I wanted to do a continuous live-stream, but since we need to save resources on our esp32 processor and not push it to it's limits, I would stick to capturing images).
Utensil_class.py
Similar to the ingredient classification, just with utensils, We have to load a different model which has been trained to classify utensils, the code will remain the same mostly, only thing changing will be the model path.
This Function handles reading the Labels of certain containers, which is taken care of by the Tesseract library, which is an optical character recognition tool, which can convert the text from the image into a string which can then, be converted to speech. In the Rpi We can import it as pytesseract.
import cv2
import pytesseract
import pyttsx3
""" Initialize the text-to-speech engine """
engine = pyttsx3.init()
url = "http://192.168.1.6:81/stream"
cap = cv2.VideoCapture(url)
# List of target words to search for
target_words = ["SALT", "PEPPER", "SAUCE", "FRUIT", "MIXED", "Mayo"]
Make sure to adjust and implement your own target words, as it may read and spit out some nonsense characters, we ask for it to search for some well used target words, so it doesn't cause any confusion.The algorithm of this is simple
1.Capture frame from the camera 2.Convert it to grayscale 3.Apply Tesseract OCR and extract the string 4.Search for keywords in this extracted string 5.If found, then read out the label to the user
(Note: I've observed that bad lighting may not produce optimal results, perhaps future implementations of VISION CHEF could implement a lighting system to avoid this)
Knife_safety.py
This code uses two models- One trained in edge impulse to detect a knife, and another which uses the mediapipe model for hand detection.This program is very straight-forward, if the two bounding boxes of the hand and knife overlap a bit too much, it plays a warning alert.
This is the part of the code which stores recipes, and which when called, read out recipes from the beginning, from the step you want or just one step alone. It could help a lot to those who cannot read from a recipe book.
Cooked_level.py
This is used to determine how well cooked the food is. For this example I have used an omelette. This model is trained in Google's Teachable Machine. It's very easy to get around and understand.
Download the model and run it in the code. It should output whether your egg is cooked well or not.
This is used in case of emergencies, If the user needs assistance from a trusted source in case of kitchen related accidents. The Raspberry Pi with the help of the Blues Notecard, sends an Alert message signal to the recipient's number.
To Setup the twilio SMS alert system, you can refer this.Once you receive the Twilio number, modify this JSON, and send this over serial port to the notecard. To send emergency notifications, make sure your notecard is connected to the RPI
twilio_request = {
"req": "note.add",
"file": "twilio.qo",
"sync": True,
"body": {
"customMessage": "This is an emergency!", # Twilio sandbox number
"customTo": "your phone number", # Replace with your phone number
"customFrom": "+twilio phone number"
}
}
Results(Note that we do not need any displays for the outcome, it is only for our debugging purposes)
We are also able to receive emergency messages thanks to the blues notecard.Thus we are able to verify and confirm our device is working!
Future Developments and ConclusionHopefully, this prototype has laid the basic foundation for way better products. It would be ideal to have specialized hardware and the whole system integrated into the hat. Accuracy of the ML models can be improved a lot.
Comments