Published September 5, 2024 © GPL3+

Chef Vision 500

The Intelligent culinary assistant for the visually impaired

IntermediateShowcase (no instructions)20 hours176

Runnerup Visual Impairments

Build2gether 2.0 — Inclusive Innovation Challenge

Things used in this project

Hardware components

Seeed Studio XIAO ESP32S3 Sense

Raspberry Pi 4 Model B

USB microphone

Li-Ion Battery 1000mAh

Blues Notecarrier A

Blues Notecard (Cellular)

Software apps and online services

Raspberry Pi Raspbian

Edge Impulse Studio

Twilio SMS Messaging API

Story

About the CHEF-VISION

This project was designed with a specific goal in mind- To assist the visually impaired in their cooking tasks.

This idea came to mind when I was watching an animated movie based on a certain talented rat, who's culinary skills assisted the lead to cook 5 star cuisine.

I thought to myself, why not? Why not implement a similar device to help us with our daily cooking tasks? This could be of high use to those with visual impairments, just getting started on cooking tasks. This could help developing kitchens to be a more inclusive space for everyone!

Main Features of CHEF-VISION:

Integration with Bluetooth headsets
Detection of Ingredients, informs where they are located and how many there are
Detection of utensils and cutlery
Reading of labels on food items
Knife Safety
Level of Cooking

The Overall System- What exactly does this do?

User and Device Interaction

The overall system is split into two parts

1 - The Wi-Fi camera hat

2- The Rpi- microphone system

The user would wear the chef's hat, which has a camera integrated into it, which connects to the rpi device

"The Esp32 wirelessly streams data (video/images) to the Raspberry Pi"

The user will speak his/her commands, which would be picked up by the microphone. The device processes this command, doing the classification etc based on the service requested and audio feedback is sent back to the user (either via Bluetooth headphone or speaker).Here are the features implemented by this device

1)Ingredient classification- VISION CHEF classifies the ingredient, tells the user which ingredients are available in front of them, and where they are located(i.e top-right, left) to enable ease of use and identification.

2) Utensil Classification- Same thing as ingredient classification, just classifies utensils like pots, knifes etc.

3) Label Reading - This is crucial in cooking, those with limited visibility wont be able to make out the difference in certain items as they may be in similar containers, but differently labelled.

4) Knife Safety - Will alert the user when a sharp object ( in this case a knife ) is detected.

5) Read out Recipes- Will read out recipes either step by step, or from a certain step.

6) Determine how well cooked an item is - Classifies whether a cooked object is well done, raw or overdone!

7) Emergency Alerts - Alert trusted personnel in the case of an emergency via SMS

What's Under the Hood?

Hardware

Since we are working with a lot of processing power and intensive Machine Learning and Computer vision models, I have decided to go with a Raspberry Pi 4 as it is perfectly capable of handling all the ML processing as well as the Text to speech and Speech to Text bit.

The Raspberry Pi system is housed in a black case with a fan to cool it. The USB microphone is attached to this system. For the purpose of demonstration I'll hook it up to a monitor display. But the device doesn't need one as we will boot this program on start.

The Pi can be connected to a wall socket, but we can connect it to a battery pack itself, if you want it to be more portable.For the camera which livestreams, we are going with the Xiao Esp32 S3 Sense. Perfectly suited for this task with it's extremely compact nature and livestreaming capabilities.

XIAO Esp32 S3 Sense

The Xiao Esp32 S3 will be connected to a battery and attached to the Chef's Hat, which would be easy to wear. For this demo I'll also be attaching it to a battery pack.

The Esp32 S3 Attached to the Chef's hat

notecard and notecarrier

We will also be using Blues' Notecard and notecarrier for sending emergency messages.

Code Review! :

1) Firmware for the Xiao esp32 sense

We will be using the CameraWebServer Firmware example code which will be available in the Arduino IDE, but, we may have to tweak it a bit

Make sure to comment out the serial.println() statements, as they may block out the camera webserver communication if there is no serial port to communicate to, this would not work when we are going to connect our Xiao esp32 sense to a DC battery.
Leave the Serial.print("Camera Ready! Use 'http://") and Serial.print(WiFi.localIP()) as these don't block the streaming, and we need these for extracting the device IP address.
Connect your Xiao esp32 to the laptop and flash the firmware

Testing of XIAO ESP32 S3 Sense Camera -

Open Serial monitor, the IP address should display

Testing out the Xiao Esp32 sense

Copy paste this IP into your browser (wait a good 10 seconds for it to respond)
You should see this interface pop up " interface image"
Click on start server- you should see a livestream from your device

Your camera module is good to go!

2) Raspberry Pi Code( Micro Python)(Note: Try to install libraries and run this code from a virtual environment in the Rpi)

Main.py

The main control panel ! This is where the commands given by the user are processed, and the functions are called from here.
We use these libraries in this page

import threading
  import speech_recognition as sr
  import pyttsx3

threading - we use a mutex to prevent race conditions and blocking by the voice recognition process( Our system would be utterly chaotic without this)

mutex = threading.Lock()

speech_recognition- the engine which classifies our speech to text

pyttsx3- text to speech library

Calling all functions(features) from other files

from label import label_loop
 from ingredient_class import classify_image
 from utensil_class import classify_image_u
 from knife_safety import process_video
 from cooked_level import cook_classify

recognize_speech() recognizes the command and returns it

def recognize_speech():
"""Capture and recognize speech using the microphone."""
with sr.Microphone() as source:
print("Listening...")
audio = recognizer.listen(source, timeout=3, phrase_time_limit=5)
try:
command = recognizer.recognize_google(audio)
print(f"You said: {command}")
return command.lower()
except sr.UnknownValueError:
print("Sorry, I did not understand that.")
return ""
except sr.RequestError:
print("Sorry, the service is down.")
return ""
except sr.WaitTimeoutError:
print("Listening timed out.")
return ""

In the perform_task() function it calls respective functions, based on the command it receives.(You can customize this to add your own functions if you want :))

def perform_task(command):
"""Perform tasks based on recognized speech."""
with mutex:  # Acquire the mutex lock
if "label" in command:
label_loop()  # This will block the main loop until it finishes
elif "vegetable" in command:
classify_image()
elif "safety" in command:
process_video()
elif "utensil" in command:
classify_image_u()
else:
speak("Sorry, I don't know how to do that.")

In the main function, is where the key word is called, I've kept it as "Hello buddy" for now, you can change it to anything you wish

def main():
"""Main function to run the voice assistant."""
while True:
command = recognize_speech()
if "hello buddy" in command:
 speak("Hello! What would you like me to help with?")
""" Wait for a specific function command after "Hello Buddy"""
 task_command = recognize_speech()
if task_command:
 perform_task(task_command) 
""" Optional: Speak after the task is completed"""
speak("Done!. What would you like to do next?")

This starts the process, it will ask for a command only after it hears the keyword- kind of like Alexa

veg_class.py

This function handles image classification of ingredients and where they are located

The models are trained using the help of edge impulse. For this project, I have taken photos using my Xiao Esp32 sense. So that detection would be done better as the training device is same as the one we will be using.

We will be using a CNN for image processingAlgorithm used is FOMO

Training the models on edge impulse

Model Performance

We use the cv2 library, pyttsx3 and the edge_impulse_linux library, as our model is trained and deployed with the help of edge impulse

import cv2
import pyttsx3
from edge_impulse_linux.image import ImageImpulseRunner

Make sure to update your camera_url( your esp32 IP address) and model path( wherever it is saved in the RPi).

camera_url = "http://192.xxx.x.xx:<port_no>/stream"
model_path = "/home/devicename/your_model_name"

The function does the following

1.Captures a frame from the camera

2.Does classification, it determines the bounding boxes

3.Output will be a text to speech, which says which ingredient was detected, and where it is located

4.Location is done with the help of bounding box coordinates

(Initially I wanted to do a continuous live-stream, but since we need to save resources on our esp32 processor and not push it to it's limits, I would stick to capturing images).

Utensil_class.py

Similar to the ingredient classification, just with utensils, We have to load a different model which has been trained to classify utensils, the code will remain the same mostly, only thing changing will be the model path.

label.py

This Function handles reading the Labels of certain containers, which is taken care of by the Tesseract library, which is an optical character recognition tool, which can convert the text from the image into a string which can then, be converted to speech. In the Rpi We can import it as pytesseract.

import cv2
import pytesseract
import pyttsx3
""" Initialize the text-to-speech engine """
engine = pyttsx3.init()
url = "http://192.168.1.6:81/stream"
cap = cv2.VideoCapture(url)
# List of target words to search for
target_words = ["SALT", "PEPPER", "SAUCE", "FRUIT", "MIXED", "Mayo"]

Make sure to adjust and implement your own target words, as it may read and spit out some nonsense characters, we ask for it to search for some well used target words, so it doesn't cause any confusion.The algorithm of this is simple

1.Capture frame from the camera 2.Convert it to grayscale 3.Apply Tesseract OCR and extract the string 4.Search for keywords in this extracted string 5.If found, then read out the label to the user

(Note: I've observed that bad lighting may not produce optimal results, perhaps future implementations of VISION CHEF could implement a lighting system to avoid this)

Knife_safety.py

This code uses two models- One trained in edge impulse to detect a knife, and another which uses the mediapipe model for hand detection.This program is very straight-forward, if the two bounding boxes of the hand and knife overlap a bit too much, it plays a warning alert.

Recipe.py

This is the part of the code which stores recipes, and which when called, read out recipes from the beginning, from the step you want or just one step alone. It could help a lot to those who cannot read from a recipe book.

Cooked_level.py

This is used to determine how well cooked the food is. For this example I have used an omelette. This model is trained in Google's Teachable Machine. It's very easy to get around and understand.

Training and deploying the model using Google's Teachable Machine

Download the model and run it in the code. It should output whether your egg is cooked well or not.

emergency.py

This is used in case of emergencies, If the user needs assistance from a trusted source in case of kitchen related accidents. The Raspberry Pi with the help of the Blues Notecard, sends an Alert message signal to the recipient's number.

Make sure you've connected your rpi and notecarrier

To Setup the twilio SMS alert system, you can refer this.Once you receive the Twilio number, modify this JSON, and send this over serial port to the notecard. To send emergency notifications, make sure your notecard is connected to the RPI

twilio_request = {
    "req": "note.add",
    "file": "twilio.qo",
    "sync": True,
    "body": {
        "customMessage": "This is an emergency!",  # Twilio sandbox number
        "customTo": "your phone number",  # Replace with your phone number
        "customFrom": "+twilio phone number"
    }
}

Results

Demo Video for the Chef Vision- Done for ingredient, label and utensil detection

(Note that we do not need any displays for the outcome, it is only for our debugging purposes)

Emergency alert from notecard!

We are also able to receive emergency messages thanks to the blues notecard.Thus we are able to verify and confirm our device is working!

Future Developments and Conclusion

Hopefully, this prototype has laid the basic foundation for way better products. It would be ideal to have specialized hardware and the whole system integrated into the hat. Accuracy of the ML models can be improved a lot.

Code

import threading
import speech_recognition as sr
import pyttsx3
# Initialize the recognizer and TTS engine
recognizer = sr.Recognizer()
engine = pyttsx3.init()

# Set thresholds for faster response
recognizer.energy_threshold = 300  # Adjust for your environment
recognizer.pause_threshold = 2   # Lower the pause threshold for quicker response

from label import label_loop
from ingredient_class import classify_image
from utensil_class import classify_image_u
from knife_safety import process_video
from cooked_level import cook_classify
from emergency import send_emergency_sms
from recipe import recipe_voice_commands
# Initialize a mutex
mutex = threading.Lock()

def speak(text):
    """Convert text to speech."""
    engine.say(text)
    engine.runAndWait()

def recognize_speech():
    """Capture and recognize speech using the microphone."""
    with sr.Microphone() as source:
        print("Listening...")
        audio = recognizer.listen(source, timeout=3, phrase_time_limit=5)
        try:
            command = recognizer.recognize_google(audio)
            print(f"You said: {command}")
            return command.lower()
        except sr.UnknownValueError:
            print("Sorry, I did not understand that.")
            return ""
        except sr.RequestError:
            print("Sorry, the service is down.")
            return ""
        except sr.WaitTimeoutError:
            print("Listening timed out.")
            return ""



def perform_task(command):
    """Perform tasks based on recognized speech."""
    with mutex:  # Acquire the mutex lock
        if "label" in command:
            label_loop()  # This will block the main loop until it finishes
        elif "vegetable" in command:
            classify_image()
        elif "safety" in command:
            process_video()
        elif "utensil" in command:
            classify_image_u()
        elif "level" in command:
            cooked_level()
        elif "emergency" in command:
            emergency_fn()
        elif "recipe" in command:
            recipe_voice_commands()
        else:
            speak("Sorry, I don't know how to do that.")

def main():
    """Main function to run the voice assistant."""
    while True:
        command = recognize_speech()
        if "hello buddy" in command:
            speak("Hello! What would you like me to help with?")
            
            # Wait for a specific function command after "Hello Buddy"
            task_command = recognize_speech()
            if task_command:
                perform_task(task_command)  # This will block the loop until the task is complete

                # Optional: Speak after the task is completed
                speak("Done!. What would you like to do next?")

if __name__ == "__main__":
    main()

import cv2
import pyttsx3
from edge_impulse_linux.image import ImageImpulseRunner

def classify_image_u():
    runner = None
    camera_url = "your url"
    model_path = "model path"
    
    # Initialize pyttsx3 for speech
    engine = pyttsx3.init()

    try:
        with ImageImpulseRunner(model_path) as runner:
            model_info = runner.init()
            print(f'Loaded runner for "{model_info["project"]["owner"]} / {model_info["project"]["name"]}"')
            labels = model_info['model_parameters']['labels']

            # Classify 5 images
            for i in range(5):
                camera = cv2.VideoCapture(camera_url)
                if not camera.isOpened():
                    raise Exception("Couldn't initialize the camera. Check the URL or port ID.")
                ret, frame = camera.read()
                camera.release()
                if not ret:
                    raise Exception("Failed to capture image from camera.")

                img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                features, cropped = runner.get_features_from_image_auto_studio_setings(img_rgb)
                res = runner.classify(features)

                speech_parts = []

                if "bounding_boxes" in res["result"].keys():
                    for bb in res["result"]["bounding_boxes"]:
                        label = bb["label"]
                        score = bb["value"]
                        
                        if score >= 0.30:  # Only consider bounding boxes with score >= 30
                            # Determine location
                            center_x = bb['x'] + bb['width'] // 2
                            center_y = bb['y'] + bb['height'] // 2

                            if center_x < cropped.shape[1] // 2 and center_y < cropped.shape[0] // 2:
                                position = "top-left"
                            elif center_x >= cropped.shape[1] // 2 and center_y < cropped.shape[0] // 2:
                                position = "top-right"
                            elif center_x < cropped.shape[1] // 2 and center_y >= cropped.shape[0] // 2:
                                position = "bottom-left"
                            else:
                                position = "bottom-right"

                            speech_parts.append(f'{label} at {position}')
                            
                            # Draw bounding box
                            cropped = cv2.rectangle(cropped, (bb['x'], bb['y']), (bb['x'] + bb['width'], bb['y'] + bb['height']), (255, 0, 0), 2)

                    cv2.imshow(f'Classified Image {i+1}', cv2.cvtColor(cropped, cv2.COLOR_RGB2BGR))
                    cv2.waitKey(500)  # Display each image for 500ms

                    # Prepare the speech output for this image
                    if speech_parts:
                        speech_text = "There is " + ", ".join(speech_parts) + "."
                        print(speech_text)
                        engine.say(speech_text)
                        engine.runAndWait()

            # Close all windows after displaying the last image
            cv2.destroyAllWindows()

    except Exception as e:
        print(f'Error: {e}')
    finally:
        if runner:
            runner.stop()
        cv2.destroyAllWindows()

# Call the function to classify images 5 times
#classify_image()

import cv2
import pyttsx3
from edge_impulse_linux.image import ImageImpulseRunner

def classify_image():
    runner = None
    
    camera_url = "your url"
    model_path = "model path"
    
    # Initialize pyttsx3 for speech
    engine = pyttsx3.init()

    try:
        with ImageImpulseRunner(model_path) as runner:
            model_info = runner.init()
            print(f'Loaded runner for "{model_info["project"]["owner"]} / {model_info["project"]["name"]}"')
            labels = model_info['model_parameters']['labels']

            # Classify 5 images
            for i in range(1):
                camera = cv2.VideoCapture(camera_url)
                if not camera.isOpened():
                    raise Exception("Couldn't initialize the camera. Check the URL or port ID.")
                ret, frame = camera.read()
                camera.release()
                if not ret:
                    raise Exception("Failed to capture image from camera.")

                img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                features, cropped = runner.get_features_from_image_auto_studio_setings(img_rgb)
                res = runner.classify(features)

                speech_parts = []

                if "bounding_boxes" in res["result"].keys():
                    for bb in res["result"]["bounding_boxes"]:
                        label = bb["label"]
                        score = bb["value"]
                        
                        if score >= 0.30:  # Only consider bounding boxes with score >= 30
                            # Determine location
                            center_x = bb['x'] + bb['width'] // 2
                            center_y = bb['y'] + bb['height'] // 2

                            if center_x < cropped.shape[1] // 2 and center_y < cropped.shape[0] // 2:
                                position = "top-left"
                            elif center_x >= cropped.shape[1] // 2 and center_y < cropped.shape[0] // 2:
                                position = "top-right"
                            elif center_x < cropped.shape[1] // 2 and center_y >= cropped.shape[0] // 2:
                                position = "bottom-left"
                            else:
                                position = "bottom-right"

                            speech_parts.append(f'{label} at {position}')
                            
                            # Draw bounding box
                            cropped = cv2.rectangle(cropped, (bb['x'], bb['y']), (bb['x'] + bb['width'], bb['y'] + bb['height']), (255, 0, 0), 2)

                    cv2.imshow(f'Classified Image {i+1}', cv2.cvtColor(cropped, cv2.COLOR_RGB2BGR))
                    cv2.waitKey(500)  # Display each image for 500ms

                    # Prepare the speech output for this image
                    if speech_parts:
                        speech_text = "There is " + ", ".join(speech_parts) + "."
                        print(speech_text)
                        engine.say(speech_text)
                        engine.runAndWait()

            # Close all windows after displaying the last image
            cv2.destroyAllWindows()

    except Exception as e:
        print(f'Error: {e}')
    finally:
        if runner:
            runner.stop()
        cv2.destroyAllWindows()

# Call the function to classify images 5 times
#classify_image()

from keras.models import load_model  # TensorFlow is required for Keras to work
import cv2  # Install opencv-python
import numpy as np
import pyttsx3  # Install pyttsx3 for text-to-speech

# Disable scientific notation for clarity
np.set_printoptions(suppress=True)

def detect_and_speak_class(model_path, labels_path, camera_index=0):

    # Load the model
    model = load_model(model_path, compile=False)

    # Load the labels
    class_names = open(labels_path, "r").readlines()

    # Initialize the text-to-speech engine
    engine = pyttsx3.init()

    # Start video capture
    camera = cv2.VideoCapture(camera_index)

    while True:
        # Capture the frame from the camera
        ret, image = camera.read()

        # Resize the image to match the model's input size
        image = cv2.resize(image, (224, 224), interpolation=cv2.INTER_AREA)

        # Display the image in a window
        cv2.imshow("Webcam Image", image)

        # Prepare the image for prediction
        image = np.asarray(image, dtype=np.float32).reshape(1, 224, 224, 3)
        image = (image / 127.5) - 1  # Normalize the image

        # Make a prediction
        prediction = model.predict(image)
        index = np.argmax(prediction)
        class_name = class_names[index].strip()
        confidence_score = prediction[0][index]

        # Print prediction and confidence score
        print("Class:", class_name, end=" ")
        print("Confidence Score:", str(np.round(confidence_score * 100))[:-2], "%")

        # Prepare and speak the text
        speech_text = f"The class is {class_name} with a confidence score of {str(np.round(confidence_score * 100))[:-2]} percent."
        engine.say(speech_text)
        engine.runAndWait()

        # Listen for the ESC key to break the loop
        if cv2.waitKey(1) == 27:  # 27 is the ESC key
            break

    # Release the camera and close all OpenCV windows
    camera.release()
    cv2.destroyAllWindows()

# Example usage:
# detect_and_speak_class("keras_model.h5", "labels.txt")

import serial
import json
import time

def send_emergency_sms():
    """
    Sends an emergency SMS using the Blues Notecard via Twilio.
    """

    # Local variable declarations
    serial_port = '/dev/ttyACM0'  # The serial port connected to the Notecard
    to_number = '+122345555543'  # Replace with your phone number
    from_number = '+q22314414151'  # Twilio sandbox number
    message = 'This is an emergency!'  # Emergency message

    try:
        # Open the serial port connected to the Notecard
        ser = serial.Serial(serial_port, 9600, timeout=1)

        # Define the Twilio API request
        twilio_request = {
            "req": "note.add",
            "file": "twilio.qo",
            "sync": True,
            "body": {
                "customMessage": message,
                "customTo": to_number,
                "customFrom": from_number
            }
        }

        # Send the command to the Notecard
        ser.write(json.dumps(twilio_request).encode('utf-8') + b'\n')

        # Read the response
        time.sleep(2)  # Give it some time to respond
        response = ser.readline().decode('utf-8')

        # Print the response from the Notecard
        print("Response:", response)

    except serial.SerialException as e:
        print(f"Error: {e}")

    finally:
        # Close the serial port
        if ser.is_open:
            ser.close()

# Example usage:
send_emergency_sms()

import cv2
import pytesseract
import pyttsx3

# Initialize the text-to-speech engine
engine = pyttsx3.init()

url = "ESP32's URL"

cap = cv2.VideoCapture(url)
# List of target words to search for
target_words = ["SALT", "PEPPER", "SAUCE", "FRUIT", "MIXED", "Mayo", "BLACK"]

def speak(text):
    engine.say(text)
    engine.runAndWait()

def capture_and_process_photo():

    # Capture a single frame
    cap = cv2.VideoCapture(url)
    ret, frame = cap.read()

    if not ret:
        print("Failed to retrieve frame.")
        return None, False

    # Convert the frame to grayscale
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    
    # Apply Tesseract OCR
    text = pytesseract.image_to_string(gray, config='--psm 6')
    
    # Print the detected text
    print("Detected text:", text)
    
    # Check for the presence of target words
    for word in target_words:
        if word.lower() in text.lower():
            return word, True
            
    
    return None, False

def label_loop():
    cap = cv2.VideoCapture(url)
    detected_word, found = capture_and_process_photo()
    if found:
        response = f"This is {detected_word}."
        print(response)
        speak(response)  # Output the response as audio
    else:
        print("No target words detected. Please try again.")
    
    cap.release()
    cv2.destroyAllWindows()
    print("read function closed")

# Call this function from another script
# label_once()

import cv2
import pyttsx3
import numpy as np
import mediapipe as mp
import threading
import time
from edge_impulse_linux.image import ImageImpulseRunner

# Initialize MediaPipe Hands
mpHands = mp.solutions.hands
hands = mpHands.Hands(
    static_image_mode=False,
    model_complexity=1,
    min_detection_confidence=0.75,
    min_tracking_confidence=0.75,
    max_num_hands=2
)

Draw = mp.solutions.drawing_utils

# Initialize pyttsx3 for speech
engine = pyttsx3.init()

# Function to check if two rectangles are close to each other
def are_rectangles_close(rect1, rect2, threshold=50):
    x1_min, y1_min, x1_max, y1_max = rect1
    x2_min, y2_min, x2_max, y2_max = rect2

    x1_center, y1_center = (x1_min + x1_max) / 2, (y1_min + y1_max) / 2
    x2_center, y2_center = (x2_min + x2_max) / 2, (y1_min + y1_max) / 2

    distance = np.sqrt((x1_center - x2_center) ** 2 + (y1_center - y2_center) ** 2)
    return distance < threshold

# Function to play voice alert
def play_voice_alert():
    text = "Please be careful with the sharp object"
    engine.say(text)
    engine.runAndWait()

# Function to classify image using Edge Impulse model
def classify_image(frame, runner):
    img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    features, cropped = runner.get_features_from_image(img_rgb)
    res = runner.classify(features)

    knife_boxes = []

    if "bounding_boxes" in res["result"].keys():
        for bb in res["result"]["bounding_boxes"]:
            label = bb["label"]
            score = bb["value"]
            if label.lower() == "knife" and score >= 0.10:  # Assuming 'knife' is a label
                knife_boxes.append((bb['x'], bb['y'], bb['x'] + bb['width'], bb['y'] + bb['height']))
                cropped = cv2.rectangle(cropped, (bb['x'], bb['y']), (bb['x'] + bb['width'], bb['y'] + bb['height']), (255, 0, 0), 2)
    
    return knife_boxes, cropped

# Function to process video and detect gestures
def process_video():
    # Declare camera URL and model path within the function
    camera_url = "http://192.168.1.9:8080/video"
    model_path = "/home/raccoon/utensil.eim"

    # Start capturing video from webcam
    cap = cv2.VideoCapture(camera_url)

    # Load Edge Impulse model
    with ImageImpulseRunner(model_path) as runner:
        model_info = runner.init()
        print(f'Loaded runner for "{model_info["project"]["owner"]} / {model_info["project"]["name"]}"')

        last_alert_time = time.time()
        alert_interval = 5  # Minimum time interval between alerts in seconds
        thumbs_up_detected = False

        while not thumbs_up_detected:
            # Read video frame by frame
            ret, frame = cap.read()
            if not ret:
                print("Failed to capture image from camera.")
                break

            # Flip image along both axes
            frame = cv2.flip(frame, -1)

            # Convert BGR image to RGB image for hand detection
            frameRGB = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

            # Process the frame for knife detection
            knife_boxes, processed_frame = classify_image(frame, runner)

            # Process the RGB image for hand detection
            Process = hands.process(frameRGB)
            
            alert_triggered = False

            if Process.multi_hand_landmarks:
                for handlm in Process.multi_hand_landmarks:
                    # Draw landmarks
                    Draw.draw_landmarks(processed_frame, handlm, mpHands.HAND_CONNECTIONS)
                    
                    # Get bounding box coordinates for the hand
                    x_min, y_min = int(handlm.landmark[0].x * processed_frame.shape[1]), int(handlm.landmark[0].y * processed_frame.shape[0])
                    x_max, y_max = x_min, y_min
                    for lm in handlm.landmark:
                        x, y = int(lm.x * processed_frame.shape[1]), int(lm.y * processed_frame.shape[0])
                        x_min, y_min = min(x_min, x), min(y_min, y)
                        x_max, y_max = max(x_max, x), max(y_max, y)

                    hand_box = (x_min, y_min, x_max, y_max)
                    cv2.rectangle(processed_frame, (x_min, y_min), (x_max, y_max), (0, 255, 0), 2)

                    # Check if hand is close to any knife
                    for knife_box in knife_boxes:
                        if are_rectangles_close(hand_box, knife_box):
                            current_time = time.time()
                            if current_time - last_alert_time > alert_interval:
                                alert_triggered = True
                                last_alert_time = current_time
                            cv2.putText(processed_frame, 'Alert! Hand and Knife Close!', (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2, cv2.LINE_AA)

            if alert_triggered:
                # Play voice alert asynchronously
                threading.Thread(target=play_voice_alert).start()

            # Display the processed video frame
            cv2.imshow('Image', processed_frame)
            if cv2.waitKey(1) & 0xff == ord('q'):
                break

            # Check for thumbs-up gesture to stop the function
            if Process.multi_hand_landmarks:
                for handlm in Process.multi_hand_landmarks:
                    landmarks = handlm.landmark
                    # Assuming thumb-up detection using specific landmarks
                    thumb_tip = landmarks[mpHands.HandLandmark.THUMB_TIP].x, landmarks[mpHands.HandLandmark.THUMB_TIP].y
                    thumb_base = landmarks[mpHands.HandLandmark.THUMB_CMC].x, landmarks[mpHands.HandLandmark.THUMB_CMC].y
                    if thumb_tip[1] < thumb_base[1]:  # Simple condition for thumbs-up (can be adjusted)
                        thumbs_up_detected = True
                        break

    cap.release()
    cv2.destroyAllWindows()
    print("Process stopped after thumbs-up gesture.")

# Example usage:
# process_video()

import pyttsx3
import speech_recognition as sr
import time

# Initialize the text-to-speech engine
engine = pyttsx3.init()

# Define the recipe steps for a Spanish omelette
recipe_steps = [
    "Step 1: Peel and thinly slice 4 medium potatoes and 1 onion.",
    "Step 2: Heat a large frying pan over medium heat and add 4 tablespoons of olive oil.",
    "Step 3: Add the sliced potatoes and onion to the pan and cook for 10 minutes, stirring occasionally, until the potatoes are soft but not browned.",
    "Step 4: Crack 6 large eggs into a large bowl and beat them lightly with a fork.",
    "Step 5: Add the cooked potatoes and onion to the eggs and season with salt and pepper.",
    "Step 6: Wipe the pan clean and heat another 2 tablespoons of olive oil over medium heat.",
    "Step 7: Pour the egg and potato mixture back into the pan, spreading it out evenly.",
    "Step 8: Cook the omelette on one side for about 5 minutes, or until the edges start to set.",
    "Step 9: Place a large plate over the pan, carefully flip the omelette onto the plate, then slide it back into the pan to cook the other side for another 3-4 minutes.",
    "Step 10: Once fully cooked, slide the omelette onto a serving plate and let it cool slightly before slicing and serving."
]

# Function to read out a specific step
def read_step(step_number):
    if 1 <= step_number <= len(recipe_steps):
        engine.say(recipe_steps[step_number - 1])
        engine.runAndWait()
    else:
        engine.say("Sorry, that step number is out of range.")
        engine.runAndWait()

# Function to read all steps
def read_all_steps():
    for step in recipe_steps:
        engine.say(step)
        engine.runAndWait()
        time.sleep(5)

# Function to read steps from a specific step number
def read_steps_from(step_number):
    if 1 <= step_number <= len(recipe_steps):
        for step in recipe_steps[step_number - 1:]:
            engine.say(step)
            engine.runAndWait()
            time.sleep(5)
    else:
        engine.say("Sorry, that step number is out of range.")
        engine.runAndWait()

# Function to recognize speech commands
def recognize_speech():
    recognizer = sr.Recognizer()
    with sr.Microphone() as source:
        print("Listening for your command...")
        audio = recognizer.listen(source)

    try:
        command = recognizer.recognize_google(audio).lower()
        print(f"You said: {command}")
        return command
    except sr.UnknownValueError:
        engine.say("Sorry, I didn't catch that. Please repeat.")
        engine.runAndWait()
        return None
    except sr.RequestError:
        engine.say("Sorry, I couldn't reach the speech recognition service.")
        engine.runAndWait()
        return None

# Main function to handle voice commands
def recipe_voice_commands():
    while True:
        command = recognize_speech()
        if command is None:
            continue

        if "exit" in command:
            engine.say("Goodbye!")
            engine.runAndWait()
            break
        elif "step" in command:
            try:
                step_number = int(command.split()[1])
                read_step(step_number)
            except (ValueError, IndexError):
                engine.say("Please say a valid step number.")
                engine.runAndWait()
        elif "all steps" in command:
            read_all_steps()
        elif "from step" in command:
            try:
                step_number = int(command.split()[2])
                read_steps_from(step_number)
            except (ValueError, IndexError):
                engine.say("Please say a valid step number.")
                engine.runAndWait()
        else:
            engine.say("Sorry, I didn't understand that command.")
            engine.runAndWait()

# Run the main function
recipe_voice_commands()

#include "esp_camera.h"
#include <WiFi.h>

#define CAMERA_MODEL_XIAO_ESP32S3 // Has PSRAM

#include "camera_pins.h"

// ===========================
// Enter your WiFi credentials
// ===========================
const char* ssid = "Airtel_Shalom@home";
const char* password = "Jehovah#Jireh";

void startCameraServer();
void setupLedFlash(int pin);

void setup() {
  Serial.begin(115200);
  //while(!Serial);
  //Serial.setDebugOutput(true);
  //Serial.println();

  camera_config_t config;
  config.ledc_channel = LEDC_CHANNEL_0;
  config.ledc_timer = LEDC_TIMER_0;
  config.pin_d0 = Y2_GPIO_NUM;
  config.pin_d1 = Y3_GPIO_NUM;
  config.pin_d2 = Y4_GPIO_NUM;
  config.pin_d3 = Y5_GPIO_NUM;
  config.pin_d4 = Y6_GPIO_NUM;
  config.pin_d5 = Y7_GPIO_NUM;
  config.pin_d6 = Y8_GPIO_NUM;
  config.pin_d7 = Y9_GPIO_NUM;
  config.pin_xclk = XCLK_GPIO_NUM;
  config.pin_pclk = PCLK_GPIO_NUM;
  config.pin_vsync = VSYNC_GPIO_NUM;
  config.pin_href = HREF_GPIO_NUM;
  config.pin_sscb_sda = SIOD_GPIO_NUM;
  config.pin_sscb_scl = SIOC_GPIO_NUM;
  config.pin_pwdn = PWDN_GPIO_NUM;
  config.pin_reset = RESET_GPIO_NUM;
  config.xclk_freq_hz = 20000000;
  config.frame_size = FRAMESIZE_UXGA;
  config.pixel_format = PIXFORMAT_JPEG; // for streaming
  //config.pixel_format = PIXFORMAT_RGB565; // for face detection/recognition
  config.grab_mode = CAMERA_GRAB_WHEN_EMPTY;
  config.fb_location = CAMERA_FB_IN_PSRAM;
  config.jpeg_quality = 12;
  config.fb_count = 1;
  
  // if PSRAM IC present, init with UXGA resolution and higher JPEG quality
  //                      for larger pre-allocated frame buffer.
  if(config.pixel_format == PIXFORMAT_JPEG){
    if(psramFound()){
      config.jpeg_quality = 10;
      config.fb_count = 2;
      config.grab_mode = CAMERA_GRAB_LATEST;
    } else {
      // Limit the frame size when PSRAM is not available
      config.frame_size = FRAMESIZE_SVGA;
      config.fb_location = CAMERA_FB_IN_DRAM;
    }
  } else {
    // Best option for face detection/recognition
    config.frame_size = FRAMESIZE_240X240;
#if CONFIG_IDF_TARGET_ESP32S3
    config.fb_count = 2;
#endif
  }

  // camera init
  esp_err_t err = esp_camera_init(&config);
  if (err != ESP_OK) {
    //Serial.printf("Camera init failed with error 0x%x", err);
    return;
  }

  sensor_t * s = esp_camera_sensor_get();
  // initial sensors are flipped vertically and colors are a bit saturated
  if (s->id.PID == OV3660_PID) {
    s->set_vflip(s, 1); // flip it back
    s->set_brightness(s, 1); // up the brightness just a bit
    s->set_saturation(s, -2); // lower the saturation
  }
  // drop down frame size for higher initial frame rate
  if(config.pixel_format == PIXFORMAT_JPEG){
    s->set_framesize(s, FRAMESIZE_QVGA);
  }

// Setup LED FLash if LED pin is defined in camera_pins.h
#if defined(LED_GPIO_NUM)
  setupLedFlash(LED_GPIO_NUM);
#endif

  WiFi.begin(ssid, password);
  WiFi.setSleep(false);

  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
    //Serial.print(".");
  }
  //Serial.println("");
  //Serial.println("WiFi connected");

  startCameraServer();

  Serial.print("Camera Ready! Use 'http://");
  Serial.print(WiFi.localIP());
  //Serial.println("' to connect");
}

void loop() {
  // Do nothing. Everything is done in another task by the web server
  delay(10000);
}

Credits

Sujay Samuel

2 projects • 2 followers

Newbie Embedded Engineer. Highly interested in the fields of Embedded systems, VLSI and Machine Learning

Contact

Comments

Please log in or sign up to comment.

Awards

Runnerup Visual Impairments

Build2gether 2.0 — Inclusive Innovation Challenge

Chef Vision 500

Things used in this project

Hardware components

Software apps and online services

Story

About the CHEF-VISION

The Overall System- What exactly does this do?

What's Under the Hood?

Code Review! :

Results

Future Developments and Conclusion

Schematics

Schematic and Connections

Code

main.py

utensil_class.py

ingredient_class.py

cook_level.py

emergency.py

label.py

knife_safety.py

recipe.py

camera_firmware

Credits

Sujay Samuel

Comments

Awards

Embed the widget on your own site

Chef Vision 500

Chef Vision 500

Things used in this project

Hardware components

Software apps and online services

Story

About the CHEF-VISION

The Overall System- What exactly does this do?

What's Under the Hood?

Code Review! :

Results

Future Developments and Conclusion

Schematics

Schematic and Connections

Code

main.py

utensil_class.py

ingredient_class.py

cook_level.py

emergency.py

label.py

knife_safety.py

recipe.py

camera_firmware

Credits

Sujay Samuel

Comments

Awards

Related channels and tags