Problem Statement
Sustainable Development Goals
Hardware
Specifications
Robot Arm Movement Demo
Software
About the FOMO model architecture
Deploying EdgeImpulse models on OAK-D
Applications of Assistive Robotics
Conclusion

•

Dhruv Sheth

Published February 14, 2023 © MIT

Gelāre: Depth-sensing & EdgeImpulse powered Assistive-Robots

Towards Spatial-guided Vision based Assistive Robotics using EdgeImpulse and LEGO Mindstorms.

AdvancedFull instructions providedOver 5 days1,884

Finalist

OpenCV AI Competition 2023

Popular Vote - Finalist

OpenCV AI Competition 2023

Things used in this project

Hardware components

LEGO Mindstorms 51515 Robot Inventor Kit

Raspberry Pi 4 Model B

Raspberry Pi Build HAT

OpenCV – Open Source Computer Vision Library OpenCV AI Kit: OAK-D

Webcam, Logitech® HD Pro

Software apps and online services

Edge Impulse Studio

Raspberry Pi Raspbian

Docker

Microsoft Visual Studio Code

Story

Complete demonstration of all applications of Assistive Robotics using the Gelare Toolkit

Problem Statement

The ability to feed oneself is a crucial aspect of daily living, and losing this ability can have a significant impact on a person's life. Robotics has the potential to assist with this task.

source - hello-robot.com

The development of a successful robotic assistive feeding system depends on its ability to acquire bites of food accurately, time the bites appropriately, and transfer the bites easily in both individual and group dining settings. However, the task of automating bite acquisition is challenging due to the vast variety of foods, utensils, and human strategies involved, as well as the need for robust manipulation of a deformable and hard-to-model target.

Bite timing, especially in social dining settings, requires a delicate balance of multiple forms of communication, such as gaze, facial expressions, gestures, and speech, as well as action and sometimes coercion. Bite transfer is a unique form of robot-human handover that involves the use of the mouth.

Source

Gesture and speech controlled robot arms are being developed to assist in feeding individuals with upper-extremity mobility limitations.

These robot arms use sensors and cameras to detect and interpret hand gestures or voice commands, allowing the user to control the robot arm in a more natural way. Gesture controlled robot arms can pick up food, bring it to the user's mouth, and adjust the position of the food based on the user's gestures.

Speech controlled robot arms can interpret voice commands such as "open mouth" or "move food closer". These robot arms have the potential to improve the independence and quality of life for individuals with upper-extremity mobility limitations and can also be used in healthcare and elderly care settings to improve the efficiency and safety of feeding assistance.

In this project, we will be focusing on Face tracking, Gesture recognition, Robot-Assisted feeding and Object Detection in cluttered environments.

Sustainable Development Goals:

source - sdgs.un.org

This project aims at assisting countries in reaching the 11th sustainable development goal by introducing technologies that make human societies more inclusive.

Hardware

Building the robot arm

You will be needing the LEGO Mindstorms 51515 kit. We followed this video by PhiloYT to create the arm. You can find the building instructions here.

build process

Instead of attaching the Mindstorms Hub, attach the Raspberry Pi and the BuildHat. At the end it should look like this:

Labelled Diagram of the Robotic Arm

1 / 5 • Robotic Arm Design

Specifications

1)OpenCV Spatial AI Kit OAK-D:

source - luxonis.com

OAK-D is a device to inference Monocular Neural Inference fused with stereo depth and Stereo Neural Inference.

DepthAI is the embedded spatial AI platform built around Myriad X - a complete ecosystem of custom hardware, firmware, software, and AI training. It combines neural inference, depth vision, and feature tracking into an easy-to-use, works-in-30-seconds solution.OAK-D is a device to inference Monocular Neural Inference fused with stereo depth and Stereo Neural Inference.

DepthAI offloads AI, depth vision and more - processed direct from built-in cameras - freeing your host to process application-specific data.

DepthAI gives you power of AI, depth, and tracking in a single device with a simple, easy-to-use API, written in Python and C++.

Best of all, it is modular (System on Module) and built on MIT-licensed open source hardware, affording adding these Spatial AI/CV super powers to real commercial products.

The Geometry behind Calculation of Depth and Disparity from Stereo Images on OAK-D:

By tracking the displacement of points between the alternate images, the distance of those points from the camera can be determined. Different disparities between the points in the two images of a stereo pair are the result of parallax. When a point in the scene is projected onto the image planes of two horizontally displaced cameras, the closer the point is to the camera baseline, the more difference is observed in its relative location on the image planes. Stereo matching aims to identify the corresponding points and retrieve their displacement to reconstruct the geometry of the scene as a depth map

Source - cs.tut.fi

A similar approach works behind the software of the OAK-D device, developed by the Luxonis team. Here's a small example demonstrating how it works -

Stereo captured from my OAK-D further converted to disparity maps

2)Build HAT:

source: raspberrypi.org

The Raspberry Pi Build HAT is a connects with any Raspberry Pi that has a 40-pin GPIO header, allowing you to control up to four LEGO Technic motors and sensors from the LEGO Education SPIKE Portfolio, creating a limitless array of possibilities for building advanced, intelligent machines that utilize both Raspberry Pi computing power and LEGO components.

Connections:

The same wires that are provided with the LEGO Robot Inventor kit can be used to connect the LEGO motors to the Build HAT.

Gripper motor to PortA
Gripper control motor to PortB
Shoulder control motor to Port C
Body Movement motor to Port D

Connections on the Raspberry Pi Build Hat

Robot Arm Movement Demo

This is the movement code:

import time
from signal import pause
from buildhat import Motor

motor_gripper = Motor('A')
motor_gripper_control = Motor('B')
motor_body_control = Motor('C')
motor_body_movement = Motor('D')

print("Position Motor Gripper Port A", motor_gripper.get_aposition())
print("Position Motor Gripper Control Port B", motor_gripper.get_aposition())
print("Position Motor Body Control Port C", motor_gripper.get_aposition())
print("Position Motor Body Movement Port D", motor_gripper.get_aposition())


TurnDegrees1 = 85
x = (TurnDegrees1/22.5)
TurnDegrees1 = 75
x2 = (TurnDegrees1/22.5)    
motor_body_control.run_to_position(70, speed=50, direction='anticlockwise')
motor_gripper_control.run_to_position(75, speed=30, direction='clockwise')
motor_body_movement.run_for_rotations(x, speed=60)
motor_gripper_control.run_to_position(-80, speed=20, direction='anticlockwise')
motor_body_control.run_to_position(-30, speed=20, direction='clockwise')
time.sleep(0.25)
motor_gripper.run_to_position(140, speed=30, direction='clockwise')
motor_gripper_control.run_to_position(70, speed=30, direction='clockwise')
motor_body_control.run_to_position(70, speed=50, direction='anticlockwise')
motor_gripper_control.run_to_position(75, speed=30, direction='clockwise')
time.sleep(0.25)
motor_body_movement.run_for_rotations(x2, speed=-60)
motor_body_control.run_to_position(-14, speed=20, direction='clockwise')
motor_gripper_control.run_to_position(-4, speed=20, direction='anticlockwise')
motor_gripper.run_to_position(13, speed=60, direction='anticlockwise')
motor_gripper_control.run_to_position(70, speed=30, direction='clockwise')

Running this code will make your robot arm pick up a block on the ground and place it somewhere else. You can try playing around with the parameters to change where the block is placed and how fast.

Troubleshootingtherobotarmmovement:

Sometimes the motors can give errors in movements when setting the angle for the motor position. You can read more about the issue here. To reduce this error, you will have to run this code to find the average error. Adjust the movement of the base motor to incorporate angular movement using the average error reading to increase the accuracy of the spatial movement.

#!/usr/bin/python3.9

import time
from buildhat import Motor

m = Motor('D')
m.bias(0.4)

sum_of_angle_differences = 0
angs = [-180, 180, 90, -90, 45, -45, 0] * 2
for i in angs:
    m.run_to_position(i)
    time.sleep(1)
    pos1 = m.get_aposition()
    print("Expected", i)
    print("Actual", m.get_aposition())
    diff = abs((i - pos1 + 180) % 360 - 180)
    sum_of_angle_differences += diff
    print(diff)

print("avg ", sum_of_angle_differences / len(angs))

This is how the code output looks:

Running the motor-diff.py code to find the error

Software

PreparingBuildHAT:

Follow these instructions to set up the Build HAT. A brief:

Seat the Build HAT onto the Raspberry Pi
Enable Serial Port and Disable Serial Console from the interfaces tab in the Raspberry Pi configuration
Connect the Build HAT power supply to the Build HAT
Install the python package to start using Build HAT

$ pip3 install buildhat

PreparingtheOAK-D:

$ git clone git@github.com:dhruvsheth-ai/Gelare.git
$ sudo curl -fL https://docs.luxonis.com/install_dependencies.sh | bash
$ cd Gelare
$ pip3 install -r requirements.txt

PreparingtheRaspberryPiforEdgeImpulse:

$ sudo apt update
$ sudo apt upgrade
$ curl -sL https://deb.nodesource.com/setup_12.x | sudo bash -
$ sudo apt install -y gcc g++ make build-essential nodejs sox gstreamer1.0-tools gstreamer1.0-plugins-good gstreamer1.0-plugins-base gstreamer1.0-plugins-base-apps
$ npm config set user root && sudo npm install edge-impulse-linux -g --unsafe-perm

You might have to set up the webcam through:

$ sudo raspi-config

About the FOMO model architecture:

Gelāre uses FOMO (Faster Objects More Objects) designed by EdgeImpulse for all the applications designed using the Robotic Arm. FOMO outperforms the traditional MobileNet models as well as YOLO for small datasets running on Raspberry Pi under 1 MilliWatt of power.

FOMO is a new approach to run object detection models on constrained devices. FOMO is an algorithm that enables real-time detection, tracking, and counting of objects on microcontrollers for the first time. It is 30 times faster than MobileNet SSD and requires less than 200K of RAM to operate.

The idea behind FOMO is that not all object-detection applications require the high-precision output that state-of-the-art deep learning models provide. By finding the right tradeoff between accuracy, speed, and memory, you can shrink your deep learning models to very small sizes while keeping them useful.

Instead of detecting bounding boxes, FOMO predicts the object’s center. This is because many object detection applications are just interested in the location of objects in the frame and not their sizes. Detecting centroids is much more compute-efficient than bounding box prediction and requires less data.

source - edgeimpulse.com

FOMO also applies a major structural change to traditional deep learning architectures.

Single-shot object detectors are composed of a set of convolutional layers that extract features and several fully-connected layers that predict the bounding box. The convolution layers extract visual features in a hierarchical way. The first layer detects simple things such as lines and edges in different directions. Each convolutional layer is usually coupled with a pooling layer, which reduces the size of the layer’s output and keeps the most prominent features in each area.

The pooling layer’s output is then fed to the next convolutional layer, which extracts higher-level features, such as corners, arcs, and circles. As more convolutional and pooling layers are added, the feature maps zoom out and can detect complicated things such as faces and objects.

source: bdtechs.com

Finally, the fully connected layers flatten the output of the final convolution layer and try to predict the class and bounding box of objects.

FOMO removes the fully connected layers and the last few convolution layers. This turns the output of the neural network into a sized-down version of the image, with each output value representing a small patch of the input image. The network is then trained on a special loss function so that each output unit predicts the class probabilities for the corresponding patch in the input image. The output effectively becomes a heatmap for object types.

FOMO’s output layer produces a heatmap of class probabilities for each corresponding area in the input image. Source: edgeimpulse.com

There are several key benefits to this approach. First, FOMO is compatible with existing architectures. For example, FOMO can be applied to MobileNetV2, a popular deep learning model for image classification on edge devices.

The FOMO model offers an alternative solution to object detection; it is a simplified version of object detection that can be used in various scenarios where the location of objects in an image is necessary, but a large or complex model cannot be utilized due to limitations in device resources.

A FOMO model is trained on centroids and is a fully convolutional model, i.e. it can be used as an add-on to any convolutional image network or even transfer learning.

However, there are a few limitations:

It is recommended for the objects to have a similar size
Objects are recommended to be spaced apart

Source

Deploying EdgeImpulse models on OAK-D:

Method 1:

This uses OAK-D as an assistant without performing the object detection on OAK-D, rather on raspberry pi itself. This doesn't affect the performance since FOMO runs in real time without any latency at high FPS on Raspberry pi itself. Both, the RGB as well as Depth Maps can be accessed via this method.

Method 2:

This method is yet in alpha stage and isn't full tested. There seems to be some loss in conversion and hence generates many false positives. We are trying our best to make the conversion process easier. This method allows running object detection on OAK-D without any processing on raspberry pi.

On EdgeImpulse Dashboard, enable the Custom deploys and Show Linux deploy options under the Administrative Zone.
Head over to deployment section and deploy the custom block. Once installed, a (trained.tflite) file should appear within the zip file.

Run the following commands on Docker. You do not need to install any packages other than Docker. It consumes about 26.7GB of host storage.

$ docker pull ghcr.io/pinto0309/tflite2tensorflow:latest

$ docker run -it --rm \
  -v `pwd`:/home/user/workdir \
  ghcr.io/pinto0309/tflite2tensorflow:latest

tflite2tensorflow \
--model_path trained.tflite \
--flatc_path ../flatc \
--schema_path ../schema.fbs \
--output_pb \
--optimizing_for_openvino_and_myriad

And then:

tflite2tensorflow \
--model_path /path/to/your/trained.tflite \
--flatc_path ../flatc \
--schema_path ../schema.fbs \
--output_no_quant_float32_tflite \
--output_onnx \
--onnx_opset 11 \

That's it! It generates a (.blob) file in one of the zip files in your folder. Replace that (.blob) file with the default (.blob) model in the python script to use a custom EdgeImpulse FOMO trained model for object detection.

For more information, click here

Applications of Assistive Robotics

FaceTracking:

Credit: EdgeImpulse Team. Model adopted from source.

This is a feedback loop mechanism which tracks the X, Y, Z coordinates of the user's input face and controls the movement of the Robotic Arm accordingly. This example employs a Face Detection Model trained on EdgeImpulse FOMO which accurately tracks input frames with low (almost negligible) false positives. The code structure takes in a case by case input of variance in X, Y or Z based on the input order and accordingly moves the Robotic Arm by the exact same radians moved by the user's input face.

To run:

$ cd software/main_ensemble/
$ python3 oakd_as_an_assistant_face_recognition.py ./models/face-detection-fomo.eim

How the program works:

The x, y, z coordinates of the user's face are calculated and then a function is invoked that moves the body of the arm according to the face's coordinates. Example code:

def bodyMovement(x, y, z):
    if x<((150)-85):
        motor_body_movement.start(30)
    elif x>((150)+85):
        motor_body_movement.start(-30)
    elif y<((150)-85):
        motor_body_control.start(-50)
    elif y>((150)+85):
        motor_body_control.start(30)
    elif z<(500):
        motor_gripper_control.start(50)
    elif z>(800):
        motor_gripper_control.start(-50)

This code moves the body of the robotic arm in the x, y and z plane based on the spatial movement of the face.

Demo using the OAK-D:

GestureControlledAssistive Robotics:

Firstly, you will have to collect data. It is recommended to collect the data through the camera that will be used for the inference. After collecting around 100+ images, you will have to label them. This is possible through Edge Impulse's labelling tool present on the data acquisition page. We have collected 3 different types of images - palm, thumb, index.

You can visit our Edge Impulse project here.

Then create the impulse to be an object detection model and set the colour depth of the images as grayscale.

1 / 2

Finally, select FOMO as the neural network architecture and start training with the default settings.

To Runit directly on Raspberry pi using Rpi Webcam:

$ cd software/main_ensemble/
$ python3 gesture_recognition_raspberry.py ./models/gesture-rec-bw.eim

To Run it using OAK-D as an assistant:

$ cd software/main_ensemble/
$ python3 gesture_recognition_oakd.py ./models/gesture-rec-bw.eim

Demo:

Robot-assisted Feeding for patients with Hand Tremor:

Collecting Data for training the FOMO Model on a 96*96 Grayscale dataset. The initial dataset consisted of Strawberries and Apples augmented to contain 280+ images. However, we later trained two additional models, one consisting of Lemons and the other consisting of a bowl of Chutney, both for different applications.

The EdgeImpulse project can be accessed here.

Initial Dataset:

Images of Strawberry and Apple (Left) and corresponding Depth Maps (Right)

Demonstration of the setup with OAK-D camera on the top and the generated RGB and Depth Maps on the Right

To investigate the algorithm behind the decision making process, check here. There exist better algorithms for robot-assisted feeding either using state-of-the-art software or hardware. The purpose of this system is not to introduce a Novel algorithm, rather to bring the ability to custom deploy low-cost, low-resource, quick solutions on affordable hardware for efficient prototyping before actual implementation. This approach can be used as a prototyping approach to design and experiment with computer vision models on resource-constrained devices which are portable.

The algorithm:

Calculation of movement of Robotic Arm to spatial location of fruit:

Given below is the algorithm to calculate the angle required to be turned by the Robotic arm to reach the position in the `X, Y` plane.

This algorithm is for a 640x480 video frame input, but with some modification it applies to a 300x300 input too. Created by us.

The next step is to calculate the angle required to be taken in the Z direction by the Robotic Arm to exert only the required amount of pressure on the Fruit.

cc: Gelare

To run the demo:

$ cd software/main_ensemble/
$ python3 robot-assisted-feeding.py ./models/fruit_feeding2.eim

Description of the code:

def moveToLocation2D(x, y):
    #ensure that the oak-d is pointing in the same direction as the front arm of the r            obotic arm.
    height = 300 - y
    distanceFromCentre = x - 150 #adjusted coordinates according to opencv coordinate system which starts at 0,0 from top left
    tan_inv = math.atan(height/distanceFromCentre)
    tan_inv_deg = 90 - math.degrees(tan_inv)
    convert_inv_deg = (tan_inv_deg/45)
    #this command aligns the robotic arm in the vectorial direction of the fruit
    if x-150>150:
        motor_body_movement.run_for_rotations(convert_inv_deg, speed=60)
    else:
        motor_body_movement.run_for_rotations(convert_inv_deg, speed=-60)

def moveToLocation3D(x, y, z):
    #used to move to exact position in the z plane.
    heightFromTable = 320 # in millimeters
    fruitHeight = z - heightFromTable #this enables calculation of exact force to be applied
    distanceToCentroid = math.sqrt((x - 0)**2 + (y - 150)**2)
    tan_inv = math.atan(z/distanceToCentroid)
    tan_inv_deg = 90 - math.degrees(tan_inv)
    motor_gripper.run_to_position((10 - tan_inv_deg), speed=30 - fruitHeight, direction='anticlockwise') #speed is diectly related to Force applied. Greater the height, less the force needed

def otherMovements():
    motor_gripper.run_to_position(-80, speed=10, direction='anticlockwise')
    motor_body_control.run_to_position(-10, speed=90, direction='clockwise')
    motor_body_control.run_to_position(70, speed=10, direction='anticlockwise')
    motor_gripper_control.run_to_position(75, speed=20, direction='clockwise')
    motor_gripper.run_to_position(50, speed=15, direction='clockwise')

This code applies the above algorithm by defining functions which move the robotic arm to the precise location in the x, y plane and the z plane and also enable application of the exact amount of force required depending upon the size of the fruit to be picked up.

ExtraAttachments:

You will have to add a fork to the arm of the robot so that it can poke and pick up the fruits while running the robot assisted feeder. At the end it should look like this:

TheWorking:

Application using Fork to pick up 'Lemons'

Application using spoon trained on a 'Chutney' Dataset

Unsuccessful attempts:

Object Detection in cluttered environments using depth maps:

Depth maps are more useful as an input for object detection models in cluttered environments due to their ability to provide a 3D representation of the environment, allowing for more accurate object detection and localization. This is due to the fact that depth maps can provide key cues about the scene layout and the relative positions of various objects in the scene, which are essential for accurately detecting and localizing objects in a cluttered environment.

More specifically, depth maps can provide information about the 3D structure of the scene, including the relative sizes of objects, as well as their distances from the camera. This information can be used to distinguish between objects in the scene, and provide more precise object localization compared to using only 2D image data. Additionally, depth maps can also provide information about occlusions, which can further aid object detection in crowded scenes.

In this application, we try to enable the notion of dexterity in LEGO Robotic Arms by spatially navigating to the cup, picking it up and pouring it into a glass bowl. The navigation process is solely driven through depth maps and it works with high accuracy in spatially locating the exact coordinates of the cup.

Testing the FOMO algorithm on depth maps trained using depth images:

Training the model:

This application shows how depth maps can be used to located a cup in a cluttered environment.

You can take a look at the Edge Impulse project here.

Start with collecting data.

Use the OAK-D to collect depth map images of the environment with the cup present
Take around 400 images
Upload them to your Edge Impulse Project
Use the built-in labelling tool to create bounding boxes

Then create the impulse to be an object detection model and then set the colour depth of the images as RGB

Finally, select FOMO as the neural network architecture and train it for 30 epochs. We went with 30 epochs to stay within the 20 min job limit for the free plan. You might have to adjust the number of epochs based on the number of data samples you have or you could take a look here.

The accuracy of the FOMO model trained only on depth maps was significantly higher than using RGB maps to detect the cup in cluttered, occluded or indistinguishable backgrounds.

To run the demo:

$ cd software/main_ensemble/
$ python3 depth-only-dexterity-manipulation.py ./models/dexterity-depth.eim

Demo:

This demo includes calculating the spatial location of the cup, picking it up and pouring it into a bowl.

Conclusion:

Finally, this machine learning project effectively integrated many technologies to produce a comprehensive solution for robot-assisted feeding. A comfortable way for the user to communicate with the robot was made possible by the combination of facial tracking and gesture recognition. The precision and dependability of the system were further improved by the use of depth maps in object detection. The findings indicated a bright future for advancements and applications in the field of rehabilitation and assisted living. This research shows the effectiveness of fusing cutting-edge technology for useful and practical applications.