In our previous article, we explored how to integrate ChatGPT with the mycobot 280 robotic arm, achieving a system that allows the robotic arm to be controlled through natural language. We detailed the motivation behind the project, key technologies used such as ChatGPT and Google's Speech-to-text service, and how we controlled the robotic arm using the pymycobot module. By combining natural language processing with robotic arm control, our project aims to lower the barrier to robotics programming, enabling non-professionals to easily engage in robotics programming and experimentation.
Next, in this article, we will discuss the challenges encountered during the development of this system, how we overcame these challenges, and the potential future expansions of the project. Our goal is to delve into the specific issues of technical implementation and explore new directions for the future development of the system.
In the process of developing the mycobot 280 robotic arm control system integrated with ChatGPT, I faced several major technical challenges.
1. Accuracy and Response Time of Speech RecognitionThe first challenge was the accuracy and response time of speech recognition. Even though Google's Speech-to-text service was used, I found that it sometimes struggled to accurately recognize technical terms or capture voice commands in noisy environments. This could be due to a lack of understanding of the underlying principles of how it operates and not knowing how to use it correctly. In addition, the delay from voice input to text output was long, and it was challenging to determine whether the speech had concluded, usually resulting in a lengthy response time.
As shown in the figure, after I finished speaking, the speech recognition would take about 3 seconds to respond.
2. Practicality and Regional Restrictions of the OpenAI APIThe ChatGPT API is the core feature of the entire project; without it, the AI-based robotic arm control system could not be implemented. Initially, I used the web version of ChatGPT for testing the code, not anticipating that using the API would pose a significant issue.
Due to regional restrictions, it was not possible to directly access OpenAI via the API, resulting in network delays and the inability to use proxy software or similar solutions for access. Besides, ensuring network stability was essential for fast processing.
If the issues mentioned above regarding code generation are resolved, we will receive command-line-like strings that need to be converted into executable code. Initially, I only considered single-line commands.
"robot.move_to_zero()"
To convert strings into executable code, Python's `getattr()` can be utilized. It is a built-in function used to get the attribute value of an object.
getattr(object, name[, default])
object:Represents the object whose properties are to be obtained
.
name:Represents the name of the attribute to be obtained.
default:Optional parameter indicating the default value returned if the specified property does not exist.
The `getattr()` function attempts to retrieve the value of a specified attribute from the specified object. If the object has the attribute, it returns the value of the attribute; if the object does not have the specified attribute but a default value is provided, it returns the default value; if the object does not have the specified attribute and no default value is provided, it raises an `AttributeError` exception.
class Myclass:
def print_1(self):
print("halo word")
obj = mycalss()
getattr(obj,"print_1")()
"""
halo word
This method perfectly solves the issue of converting string forms into executable code! Next, we discuss the process of converting strings into executable code:
The strings we receive are in the form of code, for example,
"robot.move_to_zero()"
To split this into objects and methods, Python's split method is used.
# Taking . as the node, it is divided into two parts: front and back.
command_str = "robot.move_to_zero()"
parts = command_str.split(".")
parts[0] = "robot"
part[1] = "move_to_zero()"
# Remove the brackets and keep the method name
method_name = part[1].split("()")[0]
method = getatter(robot,method_name)
method()
#Processing conversion method
def execute_command(instance,command_str):
try:
#Separate object names and methods
parts = command_str.split(".")
if len(parts) != 2 or parts[0] != 'robot':
print("Invalid command format.")
return
method_name = parts[1].split("()")[0] #remove brackets
#Use getattr to safely obtain method references
if hasattr(instance, method_name):
method = getattr(instance, method_name)
method()
else:
print(f"the method {method_name} does not exist!")
except Exception as e:
print(f"An error occurred: {e}")
This completes the processing of single-line strings. However, when testing with commands that generate multiple lines, this code becomes ineffective as it turns into a long string, rendering the method invalid.
The above three are the main issues I encountered, and I will address each of them in turn.
Solutions and Strategies1. Optimizing Speech RecognitionTo address the issue of recognition delay described above, I optimized my program by setting time limits.
# Set timeout to 3 seconds, phrase_time_limit to 10 seconds
audio = recognizer.listen(source, timeout=3, phrase_time_limit=10)
The default setting was to continuously listen without hearing any sound. By setting a time limit of 10 seconds, I was able to ensure a faster response after finishing speaking.
import speech_recognition as sr
def speech_to_text():
recognizer = sr.Recognizer()
with sr.Microphone() as source:
print("start speaking...")
try:
audio = recognizer.listen(source, timeout=3, phrase_time_limit=10)
except sr.WaitTimeoutError:
print("No speech was detected within the timeout period.")
return None
try:
text = recognizer.recognize_google(audio, language='en-US')
print("You said: " + text)
return text
except sr.UnknownValueError:
print("Google Speech Recognition could not understand audio")
return None
except sr.RequestError as e:
print(f"Could not request results from Google Speech Recognition service; {e}")
return None
This solution currently meets most requirements. From practical use, the overall functionality is quite comprehensive, effectively recognizing spoken content. Notably, when I speak numbers, it automatically converts them into Arabic numerals, which saves the trouble of processing numbers during interaction.
2. Optimizing Natural Language ConversionTo address the issue of processing multi-line commands, we cannot simply separate them as we did with single-line commands; we need to consider a different approach. Assuming the commands generated by ChatGPT are multi-line strings without comments (as ChatGPT tends to include comments), we can treat multiple lines as a single entity for processing.
"robot.move_to_zero()
robot.grab_position()
robot.plus_z_coords(20)"
# Split into multiple lines
commands = command_str.strip().split('\n')
# First, handle any whitespace characters that might be present
for cmd in commands:
cmd = cmd.strip()
if not cmd:
continue
# We assume 'obj' is 'robot', so we only need to obtain the method names
if cmd.startswith("robot."):
cmd = cmd[6:]
# Split method names and parameters
if '(' in cmd and cmd.endswith(")"):
method_name, args_str = cmd.split('(', 1)
method_name = method_name.strip()
args_str = args_str.rstrip(")")
args = [arg.strip() for arg in args_str.split(',')] if args_str else []
This approach involves treating the entire multi-line string as a single block and then systematically breaking it down into individual components—methods and their corresponding parameters. This method allows for the processing of commands that span multiple lines, addressing the limitations of the previous approach.
Regarding this problem, I have not been able to resolve it effectively. If anyone has a good solution, please feel free to message me and discuss it. Due to regional restrictions, it is not possible to directly use the API to obtain responses.
Project Expansion and Future ProspectsVisual FunctionalityIn this record, the most crucial module, the vision module, is missing. What's the difference between having a robotic arm without eyes and being blind? Developing this part will require significant effort. If future development reaches a certain stage, I will share updates in a timely manner.
I've also seen a similar project developed by Shirokuma in Japan, which utilized the ChatGPT4-vision feature to enable the functionality of identifying and grabbing specified objects. This project is quite interesting and has inspired many ideas for the development of my project.
https://twitter.com/neka_nat/status/1733517151947108717
Everyone must have seen Iron Man, and with the continuous development of AI, I believe that in the near future, there will definitely be robotic arms like those in the movies, capable of helping you complete tasks through communication.
The past few years can be considered a period of rapid advancement in artificial intelligence. AIGC (Artificial Intelligence Generated Content) is the hottest topic recently, capable of generating corresponding texts, images, videos, and audio simply from received content.
SummaryI am very excited about the future, about how AI and robotics can integrate to what extent, and whether they can already help humanity with certain tasks! If you have any good ideas or suggestions for modifications to my project, please feel free to share them with me!
Comments
Please log in or sign up to comment.