This document has been authorized for editing and reprinting by tommyZihao.
https://github.com/TommyZihao/vlm_arm/tree/main
This case has currently received 300, 000 views on the entire network.
https://www.bilibili.com/video/BV18w4m1U7Fi/?spm_id_from=333.999.0.0
With the rapid development of artificial intelligence and robotics, robotic arms are increasingly used in industrial, medical, and service sectors. By integrating large models and multimodal AI, robotic arms can perform more complex and intelligent tasks, enhancing the efficiency and effectiveness of human-robot collaboration. Although we don't often encounter robotic arms in our daily lives, there is a small robotic arm called myCobot, which is an affordable desktop robotic arm that anyone can own.
Case IntroductionThis document introduces an open-source project called "vlm_arm" by TommyZihao. This project combines the myCobot robotic arm with large models and multimodal AI to create an embodied intelligent agent. The project demonstrates how to use advanced AI technology to enhance the automation and intelligence of robotic arms. The purpose of this document is to showcase the practical application of embodied intelligent agents through a detailed introduction of the methods and successes of this case.
Product IntroductionmyCobot 280 PimyCobot 280 Pi is a six-degree-of-freedom desktop robotic arm with a primary control core based on Raspberry Pi 4B and an auxiliary control core based on ESP32. It is equipped with the Ubuntu Mate 20.04 operating system and a rich development environment, allowing development without an external PC, requiring only a monitor, keyboard, and mouse.
This robotic arm is lightweight and compact, with multiple software and hardware interaction functions, compatible with various device interfaces. It supports multi-platform secondary development, making it suitable for AI-related education, personal creative development, and commercial applications.
Camera Flange 2.0The camera used in the case is connected to the Raspberry Pi via a USB data cable, allowing for image processing in machine vision.
The suction pump operates based on an electromagnetic valve creating a pressure difference to lift objects. It is connected to the robotic arm through an IO interface and controlled using the pymycobot API.
The end of the robotic arm is connected using LEGO connectors, making it easy to connect without additional structural parts.
The entire case is compiled in a Python environment. Below are the libraries used:
● pymycobot: A Python library developed by Elephant Robotics for controlling the myCobot, capable of controlling the arm's movements through coordinates and angles, as well as official end effectors like grippers and suction pumps.
● Yi-Large: A large language model developed by the Chinese AI company 01.AI, with over 100 billion parameters. Yi-large uses an improved "Transformer" architecture for better performance in language and visual tasks.
● Claude 3 Opus: This model showcases strong multilingual processing capabilities and improved visual analysis functions, capable of transcribing and analyzing images. Designed with enhanced responsibility and safety to reduce bias and privacy issues, ensuring more trustworthy and neutral outputs.
● AppBuilder-SDK: A comprehensive SDK containing AI capabilities such as speech recognition, natural language processing, and image recognition. It includes components like short speech recognition, general text recognition, document parsing, table extraction, landmark recognition, and Q&A extraction, allowing developers to build a range of projects from basic AI functions to complex applications, improving development efficiency.
Many large language models mentioned in this case can be tested individually to see different outputs.
ProjectBefore introducing the project, it's necessary to outline its structure with a flowchart for better understanding.
First, use the local computer to record audio through the microphone.
def record(MIC_INDEX=0, DURATION=5):
os.system('sudo arecord -D "plughw:{}" -f dat -c 1 -r 16000 -d {} temp/speech_record.wav'.format(MIC_INDEX, DURATION))
Certainly, the default recording settings may not perform well in certain environments, so it is necessary to adjust the relevant parameters to ensure recording quality.
CHUNK = 1024
RATE = 16000
QUIET_DB = 2000
delay_time = 1
FORMAT = pyaudio.paInt16
CHANNELS = 1 if sys.platform == 'darwin' else 2
Adjust the parameters accordingly, then start recording. After that, the file needs to be saved.
output_path = 'temp/speech_record.wav'
wf = wave.open(output_path, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames[START_TIME-2:END_TIME]))
wf.close()
print('save', output_path)
With the audio file ready, since computers are not that intelligent on their own, we need to use the appbuild-sdk to recognize the speech in the audio file. This way, the LLM can understand what we are saying and perform the corresponding actions.
import appbuilder
os.environ["APPBUILDER_TOKEN"] = APPBUILDER_TOKEN
asr = appbuilder.ASR()
def speech_recognition(audio_path='temp/speech_record.wav'):
with wave.open(audio_path, 'rb') as wav_file:
num_channels = wav_file.getnchannels()
sample_width = wav_file.getsampwidth()
framerate = wav_file.getframerate()
num_frames = wav_file.getnframes()
frames = wav_file.readframes(num_frames)
content_data = {"audio_format": "wav", "raw_audio": frames, "rate": 16000}
message = appbuilder.Message(content_data)
speech_result = asr.run(message).content['result'][0]
return speech_result
Prompt-AgentNext, prompt the large language model to respond appropriately based on specific situations.
Prompt: (Extracted excerpts, the following is a translation of the Chinese text)
You are my robotic arm assistant. The robotic arm has some built-in functions. Please output the corresponding functions to be executed and your reply to me in JSON format according to my instructions.
【The following are all built-in function descriptions】
Reset the robotic arm position, all joints return to the origin: back_zero()
Relax the robotic arm, all joints can be freely dragged manually: relax_arm()
Perform a head-shaking action: head_shake()
Perform a nodding action: head_nod()
Perform a dancing action: head_dance()
Turn on the suction pump: pump_on()
Turn off the suction pump: pump_off()
【Output JSON format】
You directly output JSON, starting from {, do not output the beginning or end containing ```json.
In the 'function' key, output a list of function names, each element in the list is a string representing the function name and parameters to be executed. Each function can be executed individually or in sequence with other functions. The order of the elements in the list indicates the order of execution.
In the 'response' key, according to my instructions and the actions you arranged, output your reply to me in the first person. Do not exceed 20 characters. You can be humorous and divergent, using lyrics, lines, internet memes, and famous scenes. For example, Li Yunlong's lines, Zhen Huan's lines, two and a half years of practice.
【The following are some specific examples】
My instruction: Return to origin. You output: {'function':['back_zero()'], 'response':'Go home, back to the original beauty'}
My instruction: First return to the origin, then dance. You output: {'function':['back_zero()', 'head_dance()'], 'response':'My dance moves, practiced for two and a half years'}
My instruction: First return to the origin, then move to coordinates 180, -90. You output: {'function':['back_zero()', 'move_to_coords(X=180, Y=-90)'], 'response':'Precision? I'm hitting the elite'}
Intelligent Visual GraspingIn this process, the myCobot moves to an overhead position to capture images, which are then processed by the visual model to obtain parameters for the robotic arm to perform grasping actions.
def check_camera():
cap = cv2.VideoCapture(0)
while(True):
ret, frame = cap.read()
# gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
cv2.imshow('frame', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
Hand over the image to the large model for processing. The obtained parameters need further processing to visualize the results. Finally, convert the returned normalized coordinates into actual pixel coordinates in the image.
def post_processing_viz(result, img_path, check=False):
'''
Post-processing and visualization of the results output by the visual large model
check: Whether manual confirmation of successful visualization is needed, press a key to continue or exit
'''
# Post-processing
img_bgr = cv2.imread(img_path)
img_h = img_bgr.shape[0]
img_w = img_bgr.shape[1]
# Scaling factor
FACTOR = 999
# Starting object name
START_NAME = result['start']
# Ending object name
END_NAME = result['end']
# Starting point, top-left pixel coordinates
START_X_MIN = int(result['start_xyxy'][0][0] * img_w / FACTOR)
START_Y_MIN = int(result['start_xyxy'][0][1] * img_h / FACTOR)
# Starting point, bottom-right pixel coordinates
START_X_MAX = int(result['start_xyxy'][1][0] * img_w / FACTOR)
START_Y_MAX = int(result['start_xyxy'][1][1] * img_h / FACTOR)
# Starting point, center pixel coordinates
START_X_CENTER = int((START_X_MIN + START_X_MAX) / 2)
START_Y_CENTER = int((START_Y_MIN + START_Y_MAX) / 2)
# Ending point, top-left pixel coordinates
END_X_MIN = int(result['end_xyxy'][0][0] * img_w / FACTOR)
END_Y_MIN = int(result['end_xyxy'][0][1] * img_h / FACTOR)
# Ending point, bottom-right pixel coordinates
END_X_MAX = int(result['end_xyxy'][1][0] * img_w / FACTOR)
END_Y_MAX = int(result['end_xyxy'][1][1] * img_h / FACTOR)
# Ending point, center pixel coordinates
END_X_CENTER = int((END_X_MIN + END_X_MAX) / 2)
END_Y_CENTER = int((END_Y_MIN + END_Y_MAX) / 2)
# Visualization
# Draw the starting object box
img_bgr = cv2.rectangle(img_bgr, (START_X_MIN, START_Y_MIN), (START_X_MAX, START_Y_MAX), [0, 0, 255], thickness=3)
# Draw the starting center point
img_bgr = cv2.circle(img_bgr, [START_X_CENTER, START_Y_CENTER], 6, [0, 0, 255], thickness=-1)
# Draw the ending object box
img_bgr = cv2.rectangle(img_bgr, (END_X_MIN, END_Y_MIN), (END_X_MAX, END_Y_MAX), [255, 0, 0], thickness=3)
# Draw the ending center point
img_bgr = cv2.circle(img_bgr, [END_X_CENTER, END_Y_CENTER], 6, [255, 0, 0], thickness=-1)
# Write Chinese object names
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB) # BGR to RGB
img_pil = Image.fromarray(img_rgb) # array to PIL
draw = ImageDraw.Draw(img_pil)
# Write the Chinese name of the starting object
draw.text((START_X_MIN, START_Y_MIN-32), START_NAME, font=font, fill=(255, 0, 0, 1)) # Text coordinates, Chinese string, font, RGBA color
# Write the Chinese name of the ending object
draw.text((END_X_MIN, END_Y_MIN-32), END_NAME, font=font, fill=(0, 0, 255, 1)) # Text coordinates, Chinese string, font, RGBA color
img_bgr = cv2.cvtColor(np.array(img_pil), cv2.COLOR_RGB2BGR) # RGB to BGR
return START_X_CENTER, START_Y_CENTER, END_X_CENTER, END_Y_CENTER
Hand-eye calibration is needed to convert the pixel coordinates in the image into coordinates for the robotic arm, making it easier for the robotic arm to perform the grasping action.
def eye2hand(X_im=160, Y_im=120):
# Organize the coordinates of the two calibration points
cali_1_im = [130, 290] # Bottom left, pixel coordinates of the first calibration point, fill in manually!
cali_1_mc = [-21.8, -197.4] # Bottom left, robotic arm coordinates of the first calibration point, fill in manually!
cali_2_im = [640, 0] # Top right, pixel coordinates of the second calibration point
cali_2_mc = [215, -59.1] # Top right, robotic arm coordinates of the second calibration point, fill in manually!
X_cali_im = [cali_1_im[0], cali_2_im[0]] # Pixel coordinates
X_cali_mc = [cali_1_mc[0], cali_2_mc[0]] # Robotic arm coordinates
Y_cali_im = [cali_2_im[1], cali_1_im[1]] # Pixel coordinates, from small to large
Y_cali_mc = [cali_2_mc[1], cali_1_mc[1]] # Robotic arm coordinates, from large to small
# X difference
X_mc = int(np.interp(X_im, X_cali_im, X_cali_mc))
# Y difference
Y_mc = int(np.interp(Y_im, Y_cali_im, Y_cali_mc))
return X_mc, Y_mc
By integrating all the technologies together, a complete Agent is formed, enabling the realization of the Agent's capabilities.
SummaryThe VLM_ARM project demonstrates the immense potential of combining multiple large models with robotic arms, offering new ideas and methods for human-machine collaboration and intelligent applications. This case not only showcases the innovation and practicality of the technology but also provides valuable experience and reference for the development of similar projects in the future. Through an in-depth analysis of the project, we can see the significant impact of using multiple models in parallel to enhance system intelligence, laying a solid foundation for the further advancement of robotics technology.
We are getting closer to achieving a JARVIS-like system from Iron Man, and the scenes depicted in movies will eventually become reality.
Comments