Recently, I made a project to build an AI desktop assistant using OpenAI's GPT (Generative Pre-trained Transformer) technology and Azure Speech API. OpenAI GPT, a cutting-edge natural language processing model, can understand and generate human-like text, streamlining communication with computers. Coupled with Azure Speech API's voice recognition capabilities.
This project was inspired by a landmark event in the tech industry that took place on November 8th — OpenAI developers' conference. During the conference, OpenAI showed the cross-modal interaction and introduced the GPT store (GPTs), a collection of applications based on GPT technology, each offering expert-level services in their respective fields. The introduction of cross-modal interaction and the GPT store has opened up a new method for my AI desktop assistant project, enabling it to perform more complex tasks using natural language processing and speech interaction.
To bring this project to life, I've chosen the hardware to make it with Unihiker. This device has a built-in touchscreen, Wi-Fi, Bluetooth, and sensors for light, motion, and gyroscopic measurements. Its co-processor is ideally suited for interaction with external sensors and actuators, and the provided Python library significantly simplifies device control. In the following, I will introduce the development process of this AI desktop assistant, which integrates Microsoft Azure and OpenAI GPT, and share an alternative method for creating an intelligent voice desktop agent using only the OpenAI API.
Part 1: Preparing the Hardware1. HardwareFor the hardware integration of our miniaturized desktop product, I have chosen a 10cm power extension cord to facilitate a more rational design for the power charging port. At the same time, we have combined an amplifier with a dual-channel speaker to ensure high-quality sound playback despite the small size.
The connection interface of the amplifier needs to be directly soldered onto the solder points on the Unihiker, as shown in the following figure.
3D printing file link: https://www.thingiverse.com/thing:6307018
In terms of design, my model integrates the power port with the shape of the brand's IP, where the 'circle on top' serves as the power button.
During installation, to maximize the internal space, a specific sequence of assembly is required. After fixing the Speakers in its place, it should be lifted from the top, allowing the Unihiker to be inserted from the bottom. Once the Unihiker is secured at the screen's fixed position, the sound card can be pushed in, completing the successful installation.
To implement this feature, we need to combine multiple libraries and APIs for speech recognition, text-to-speech, and interaction with the GPT model.
First we need to register azure and openai to ensure that ‘azure.speech_key’ and ‘openai.api_key’ could be used.How to register Azure and get the API key, please check this tutorial: https://community.dfrobot.com/makelog-313501.htmlHow to find openai's api key, please check this link: https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key
After registering the APIs of these two platforms, check Azure’s ‘speech_key’ and ‘service_region’ and OpenAI’s ‘openai.api_key’. These two settings will be used later.
After getting the API, start writing python programs.
These functions are based on the network interface, so Unihiker needs to be connected to the network first.How to connect to unihiker to programmin, please check: https://www.unihiker.com/wiki/connection
1. Import libraries and modules:
- unihiker.Audio: Provides audio-related functions.- unihiker.GUI: Create a graphical user interface (GUI).- openai: used to interact with OpenAI models.- time: Provides time-related functions.- os: Provides operating system related functions.- azure.cognitiveservices.speech.SpeechConfig: Provides speech-related functions.
from unihiker import Audio
from unihiker import GUI
import openai
import time
import os
from azure.cognitiveservices.speech import SpeechConfig
2. Set the key and region:
Create an instance with your key and location/region.- speech_key: Specifies the key for the Azure Speech service.- service_region: Specifies the region/location of the Azure Voice service.
speech_key = "xxxxx" # Fill key
service_region = "xxx" # Enter Location/Region
3. Set OpenAI API key:
- openai.api_key: Set the API key for interacting with the OpenAI GPT model.
openai.api_key = "xxxxxxxxxxx" #input OpenAI api key
4. Import Azure Speech SDK:
- The code attempts to import the azure. cognitiveservices.speech module and prints an error message if the import fails.
try:
import azure.cognitiveservices.speech as speechsdk
except ImportError:
print("""
Importing the Speech SDK for Python failed.
Refer to
https://docs.microsoft.com/azure/cognitive-services/speech-service/quickstart-python for
installation instructions.
""")
import sys
sys.exit(1)
5. Function: Recognize speech from default microphone
- Use default microphone to synthesize speech- recognize_once_async:Performs recognition in a non-blocking (asynchronous) mode. This will recognize a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.
# speech to text
def recognize_from_microphone():
# This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
speech_recognition_result = speech_recognizer.recognize_once_async().get()
# Exception reminder
if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
return speech_recognition_result.text
elif speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:
print("No speech could be recognized: {}".format(speech_recognition_result.no_match_details))
elif speech_recognition_result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = speech_recognition_result.cancellation_details
print("Speech Recognition canceled: {}".format(cancellation_details.reason))
if cancellation_details.reason == speechsdk.CancellationReason.Error:
print("Error details: {}".format(cancellation_details.error_details))
print("Did you set the speech resource key and region values?")
6. tts(text): Use Azure Speech SDK to convert text to speech.
- Play speech using default speakers
# text to speech
def tts(text):
speech_config.set_property(property_id=speechsdk.PropertyId.SpeechServiceResponse_RequestSentenceBoundary, value='true')
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
speech_synthesizer.synthesis_word_boundary.connect(speech_synthesizer_word_boundary_cb)
speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()
7. speech_synthesizer_word_boundary_cb(evt):
callback function that handles word boundaries in the speech synthesis process. Achieve the effect of words appearing one after another.
# display text one by one
def speech_synthesizer_word_boundary_cb(evt: speechsdk.SessionEventArgs):
global text_display
if not (evt.boundary_type == speechsdk.SpeechSynthesisBoundaryType.Sentence):
text_result = evt.text
text_display = text_display + " " + text_result
trans.config(text = text_display)
if evt.text == ".":
text_display = ""
8. askOpenAI(question):
Send a question to the OpenAI GPT model and return the generated answer. (You can choose other versions of gpt models)
# openai
def askOpenAI(question):
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages = question
)
return completion['choices'][0]['message']['content']
9. Event callback function:- button_click1(): Set the flag variable to 1.- button_click2(): Set the flag variable to 3.
def button_click1():
global flag
flag = 1
def button_click2():
global flag
flag = 3
10. Voice service configuration:- voice_config: Configures the Azure Speech SDK using the provided voice key, region, language, and voice settings.- Graphical user interface initialization:- Text-to-speech functionality in Speech Services supports more than 400 voices and more than 140 languages and their variations. You can get the full list or try it in the speech library (https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=tts).
# speech service configuration
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
speech_config.speech_synthesis_language = 'en-US'
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"
(Speaking voices are determined in order of priority as follows:- If SpeechSynthesisVoiceName or SpeechSynthesisLanguage is not set, the default sound is en-US.- If only SpeechSynthesisLanguage is set, the default voice for the specified locale is spoken.- If both SpeechSynthesisVoiceName and SpeechSynthesisLanguage are set, SpeechSynthesisLanguage ignores this setting. You speak using the voice specified by SpeechSynthesisVoiceName.- If you use Speech Synthesis Markup Language (SSML) to set speech elements, the SpeechSynthesisVoiceName and SpeechSynthesisLanguage settings will be ignored. )
11. Initialize GUI and audio objects:- u_gui: Create an instance of the GUI class from the unihiker library.- u_audio: Create an instance of the Audio class from the unihiker library.- Create and configure various GUI elements such as images, buttons and text.- The screen resolution is 240x320, so the unihiker library resolution is also 240x320. 📷The origin of the coordinates is the upper left corner of the screen, the positive direction of the x-axis is to the right, and the positive direction of the y-axis is downward.
u_gui=GUI()
u_audio = Audio()
# GUI initialization
img1=u_gui.draw_image(image="background.jpg",x=0,y=0,w=240)
button=u_gui.draw_image(image="mic.jpg",x=13,y=240,h=60,onclick=button_click1)
refresh=u_gui.draw_image(image="refresh.jpg",x=157,y=240,h=60,onclick=button_click2)
init=u_gui.draw_text(text="Tap to speak",x=27,y=50,font_size=15, color="#00CCCC")
trans=u_gui.draw_text(text="",x=2,y=0,font_size=12, color="#000000")
trans.config(w=230)
result = ""
flag = 0
text_display = ""
message = [{"role": "system", "content": "You are a helpful assistant."}]
user = {"role": "user", "content": ""}
assistant = {"role": "assistant", "content": ""}
12. Main loop:
The code enters an infinite loop, constantly checking the value of the flag variable and performing corresponding operations based on its value.When flag is 0, the GUI button is enabled.When the flag is 1, the code listens for voice input from the microphone, adds the user's message to the message list, and updates the GUI with the recognized text.When flag is 2, the code interacts with the OpenAI model by sending a list of messages and generating a response. The response is then synthesized into speech.When flag is 3, the message list is cleared and a system message is added.
while True:
if (flag == 0):
button.config(image="mic.jpg",state="normal")
refresh.config(image="refresh.jpg",state="normal")
if (flag == 3):
message.clear()
message = [{"role": "system", "content": "You are a helpful assistant."}]
if (flag == 2):
azure_synthesis_result = askOpenAI(message)
assistant["content"] = azure_synthesis_result
message.append(assistant.copy())
tts(azure_synthesis_result)
time.sleep(1)
flag = 0
trans.config(text=" ")
button.config(image="",state="normal")
refresh.config(image="",state="normal")
init.config(x=15)
if (flag == 1):
init.config(x=600)
trans .config(text="Listening。。。")
button.config(image="",state="disable")
refresh.config(image="",state="disable")
result = recognize_from_microphone()
user["content"] = result
message.append(user.copy())
trans .config(text=result)
time.sleep(2)
trans .config(text="Thinking。。。")
flag = 2
I. Complete code in the first method: GPT & Azurefrom unihiker import Audio
from unihiker import GUI
import openai
import time
import os
from azure.cognitiveservices.speech import SpeechConfig
speech_key = "xxxxxxxxx" # Fill key
service_region = "xxxxx" # Enter Location/Region
openai.api_key = "xxxxxxxxxx" # inputOpenAI api key
try:
import azure.cognitiveservices.speech as speechsdk
except ImportError:
print("""
Importing the Speech SDK for Python failed.
Refer to
https://docs.microsoft.com/azure/cognitive-services/speech-service/quickstart-python for
installation instructions.
""")
import sys
sys.exit(1)
# Set up the subscription info for the Speech Service:
# Replace with your own subscription key and service region (e.g., "japaneast").
# speech to text
def recognize_from_microphone():
# This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
speech_recognition_result = speech_recognizer.recognize_once_async().get()
# Exception reminder
if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
# print("Recognized: {}".format(speech_recognition_result.text))
return speech_recognition_result.text
elif speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:
print("No speech could be recognized: {}".format(speech_recognition_result.no_match_details))
elif speech_recognition_result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = speech_recognition_result.cancellation_details
print("Speech Recognition canceled: {}".format(cancellation_details.reason))
if cancellation_details.reason == speechsdk.CancellationReason.Error:
print("Error details: {}".format(cancellation_details.error_details))
print("Did you set the speech resource key and region values?")
# text to speech
def tts(text):
speech_config.set_property(property_id=speechsdk.PropertyId.SpeechServiceResponse_RequestSentenceBoundary, value='true')
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
speech_synthesizer.synthesis_word_boundary.connect(speech_synthesizer_word_boundary_cb)
speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()
# display text one by one
def speech_synthesizer_word_boundary_cb(evt: speechsdk.SessionEventArgs):
global text_display
if not (evt.boundary_type == speechsdk.SpeechSynthesisBoundaryType.Sentence):
text_result = evt.text
text_display = text_display + " " + text_result
trans.config(text = text_display)
if evt.text == ".":
text_display = ""
# openai
def askOpenAI(question):
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages = question
)
return completion['choices'][0]['message']['content']
# speech service configuration
def button_click1():
global flag
flag = 1
def button_click2():
global flag
flag = 3
u_gui=GUI()
u_audio = Audio()
# speech service configuration
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
speech_config.speech_synthesis_language = 'en-US'
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"
# GUI initialization
img1=u_gui.draw_image(image="background.jpg",x=0,y=0,w=240)
button=u_gui.draw_image(image="mic.jpg",x=13,y=240,h=60,onclick=button_click1)
refresh=u_gui.draw_image(image="refresh.jpg",x=157,y=240,h=60,onclick=button_click2)
init=u_gui.draw_text(text="Tap to speak",x=27,y=50,font_size=15, color="#00CCCC")
trans=u_gui.draw_text(text="",x=2,y=0,font_size=12, color="#000000")
trans.config(w=230)
result = ""
flag = 0
text_display = ""
message = [{"role": "system", "content": "You are a helpful assistant."}]
user = {"role": "user", "content": ""}
assistant = {"role": "assistant", "content": ""}
while True:
if (flag == 0):
button.config(image="mic.jpg",state="normal")
refresh.config(image="refresh.jpg",state="normal")
if (flag == 3):
message.clear()
message = [{"role": "system", "content": "You are a helpful assistant."}]
if (flag == 2):
azure_synthesis_result = askOpenAI(message)
assistant["content"] = azure_synthesis_result
message.append(assistant.copy())
tts(azure_synthesis_result)
time.sleep(1)
flag = 0
trans.config(text=" ")
button.config(image="",state="normal")
refresh.config(image="",state="normal")
init.config(x=15)
if (flag == 1):
init.config(x=600)
trans .config(text="Listening。。。")
button.config(image="",state="disable")
refresh.config(image="",state="disable")
result = recognize_from_microphone()
user["content"] = result
message.append(user.copy())
trans .config(text=result)
time.sleep(2)
trans .config(text="Thinking。。。")
flag = 2
On the basic of the above technical path, OpenAI's cross-modal capabilities further strengthen its ecosystem, allowing developers to develop openai-based applications more quickly. In terms of the performance of a single mode, the recently updated DALL·E 3 is not inferior to the previously leading Midjourney and Stable Diffusion in terms of visual effects. Combining visual capabilities, GPT-4, and text-to-speech conversion technologies TTS and Co-pilot in partnership with Microsoft, this cross-modal integration will greatly simplify the process of realizing complex logic and task execution through natural language communication. At the same time, GPT-4 has also received a major update. The new GPT-4 Turbo version supports users to upload external databases or files, handles context lengths of up to 128k (equivalent to a 300-page book), and the knowledge base has been updated to 2023 In April of this year, API prices were also heavily discounted.
II. The second method: OpenAI handle all (gpt+whisper+tts)Through this interface integration, we can try to use all openai's APIs to implement this function. Using openai's "whisper + gpt + tts" can also achieve the above functions. The advantage is that you can only register openai and obtain the key to implement the function, and the language category can be automatically identified. However, openai's whisper cannot support real-time conversion for the time being, so there are still differences in code writing and program response.
- Initialize and record audio- Use whisper model to record speech and convert it to text: The transcriptions API takes as input the audio file you want to transcribe and the desired output file format for the transcription of the audio. We currently support multiple input and output file formats. File uploads are currently limited to 25 MB and the following input file types are supported: mp3, mp4, mpeg, mpga, m4a, wav, and webm.- Use the gpt-3.5-turbo model to generate answers.- Use the tts-1 model to convert text into speech and output audio files.- Play audio files
import openai
import pyaudio
import wave
import time
import os
openai.api_key="xxxxxxxx" # input your openai key
def record_and_convert():
# Define recording parameters
CHUNK = 1024 # Number of frames per buffer
FORMAT = pyaudio.paInt16 #Data format
CHANNELS = 1 # Number of channels
RATE = 44100 # sampling rate
RECORD_SECONDS = 10 # Recording time
WAVE_OUTPUT_FILENAME = "output.wav" # Output file name
# initialization pyaudio
p = pyaudio.PyAudio()
# Open recording stream
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
print("Start recording, please speak...")
frames = []
# Record audio data
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)
print("Recording ends!")
# Stop recording
stream.stop_stream()
stream.close()
p.terminate()
# Save recording data to file
wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()
audio_file= open("output.wav", "rb")
transcript = openai.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("transform completed")
input_txt = transcript
response_text = openai.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user",
"content": input_txt}
]
)
#print(response_text.choices[0].message)
input_tts = response_text.choices[0].message.content
#Convert to speech
print("Start converting speech")
response = openai.audio.speech.create(
model="tts-1",
voice="shimmer",
input=input_tts,
)
response.stream_to_file("output2.mp3")
#Play the speech
os.system("play output2.mp3")
if __name__ == "__main__":
while True:
record_and_convert()
time.sleep(1)
Through the above two methods, we implemented the gpt voice agent assistant function. The integration of Azure Speech API and OpenAI GPT has opened up a new frontier for developing intelligent desktop assistants. The advancements in natural language processing and speech recognition technologies are making our interactions with computers more natural and efficient. By harnessing the power of these technologies, we can build applications that can perform complex tasks and provide expert-level services in their respective fields. In the upcoming posts, I will continue to share the development process of this intelligent assistant and explore more possibilities that this integration can bring.
GPTs and rich APIs enable us to easily develop and implement personalized intelligent agents. An intelligent agent can be understood as a program that can simulate human intelligent behavior when interacting with the external environment. For example, the control system of a self-driving car is an intelligent agent. At the developer conference, OpenAI employees showed an example: uploading a PDF of flight information in one second, and the intelligent agent can sort out the ticket information and display it on the web page. If combined with more different hardware interfaces, we can try to customize more of our own gpt applications without the computer or mobile phone interface. Just like the application case of this smart desktop assistant, more physical-level intelligent controls can be expanded in the future to achieve a more natural ‘intelligent agent’.
The future we once envisioned for AI agents has now become a reality.
Comments