This project used NVIDIA Riva and OpenAI API, deployed on an NVIDIA Jetson device, to build an interactive chatbot. It has the following features:
- Wakeup worddetection
- Continuous conversation
First, the speech input from the microphone is converted into text using Riva's Automatic Speech Recognition (ASR) library, and then it is passed to the OpenAI API. When the OpenAI API returns the result, the text is converted into speech using Riva's Text-to-Speech (TTS) library, and it is output through the microphone.
Riva is a speech processing platform developed by NVIDIA that helps developers build powerful speech applications. It offers various speech processing capabilities, including Automatic Speech Recognition (ASR), Text-to-Speech (TTS), Nature Language Processing(NLP), Neural Machine Translation(NMT), and speech synthesis. It utilizes NVIDIA's GPU acceleration technology, ensuring high performance even under heavy workloads and provides user-friendly API interfaces and SDK tools, making it easy for developers to build speech applications. Riva offers pretrained speech models in NVIDIA NGC™ that can be fine-tuned with the NVIDIA NeMo on a custom data set, accelerating the development of domain-specific models by 10x.
What is ASR in Riva?Riva's ASR (Automatic Speech Recognition) is an advanced technology developed by NVIDIA. It accurately converts spoken language into written text using deep learning models and algorithms. It is widely used for real-time transcription, voice commands, and other speech-to-text applications.
What is TTS in Riva?Riva's TTS (Text-to-Speech) is an advanced technology that generates high-quality, natural-sounding speech from written text. It uses deep learning techniques to produce human-like speech with accurate pronunciation and expression. Developers can customize parameters to achieve desired voice characteristics. It is used in applications like virtual assistants, audiobooks, and accessibility solutions.
What is OpenAI API?The OpenAI API enables developers to integrate advanced natural language processing capabilities into their applications. It provides access to powerful language models that can generate human-like text based on prompts. Developers can make requests to the API, receiving generated text as output for tasks like text generation, translation, and more.
Where will we deploy it?We will deploy the above solution on a ReComputer J4012 which is built with Jetson Orin NX 16GB, a powerful and compact intelligent edge box to bring up to 100 TOPS modern AI performance to the edge, which offers up to 5X the performance of Jetson Xavier NX and up to 3X the performance of Jetson AGX Xavier. Combining the NVIDIA Ampere™ GPU architecture with 64-bit operating capability, Orin NX integrates advanced multi-function video and image processing, and NVIDIA Deep Learning Accelerators.
First of all, you should flash jetpack, More detail are available here.
Step 2 Look for the adapted Riva version- In this case, we using Riva 2.11.0 on embedded platforms. JetPack 5.1 or JetPack 5.1.1 is needed.You can refer to Support Matrix.
- You have ~15 GB free disk space on Jetson as required by the default containers and models. If you are deploying any Riva model intermediate representation (RMIR) models, the additional disk space required is ~14 GB plus the size of the RMIR models
- You have enabled the following power modes on the Jetson platform. These modes activate all CPU cores and clock the CPU/GPU at maximum frequency for achieving the best performance.
sudo nvpmodel -m 0 (Jetson Orin AGX, mode MAXN)
sudo nvpmodel -m 0 (Jetson Xavier AGX, mode MAXN)
sudo nvpmodel -m 2 (Jetson Xavier NX, mode MODE_15W_6CORE)
- You have set the default runtime to
nvidia
on the Jetson platform by adding the following line in the/etc/docker/daemon.json
file. Restart the Docker service usingsudo systemctl restart docker
after editing the file
"default-runtime": "nvidia"
Step 3 Install CLI toolsopen a terminal, enter:
wget --content-disposition https://ngc.nvidia.com/downloads/ngccli_arm64.zip && unzip ngccli_arm64.zip && chmod u+x ngc-cli/ngc
Check the binary's md5 hash to ensure the file wasn't corrupted during download:
find ngc-cli/ -type f -exec md5sum {} + | LC_ALL=C sort | md5sum -c ngc-cli.md5
Add your current directory to path:
Add your current directory to path:echo "export PATH=\"\$PATH:$(pwd)/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile
Or create a symlink:
ln -s $(pwd)/ngc-cli/ngc [destination_path]/ngc
You must configure NGC CLI for your use so that you can run the commands.
Enter the following command, including your API key when prompted:
ngc config set
Step 4 Local Deployment Using Quick Start Scriptsuse the NGC CLI tool to download from the command line.
ngc registry resource download-version nvidia/riva/riva_quickstart_arm64:2.11.0
Initialize and start Riva. The initialization step downloads and prepares Docker images and models. The start script launches the server.
Modify the config.sh
file within the quickstart
directory with the below configurations:
In the example below, TTS and ASR are true
, which enables the text2speech and ASR services. The NMT and NLP services are false
, which disables these services.
# Enable or Disable Riva Services
service_enabled_tts=true
service_enabled_asr=true
service_enabled_nmt=false
service_enabled_nlp=false
The models are installed and configured if they are uncommented in config.sh
and the corresponding service is enabled.
Once the configuration is complete, enter the following command:
cd riva_quickstart_arm64_v2.11.0
Initialize and start Riva:
bash riva_init.sh
bash riva_start.sh
How to run Riva’s ASR?setting your input devices and sample rate (the default is 16000), you can check the input device by using the following command:
python3 transcribe_mic.py --list-devices
then run the scripts/asr/transcribe_mic.py
python3 transcribe_mic.py --input-device <device_number> --sample-rate-hz <sample_rate>
Now when you talk to the microphone, the speech will be converted to text and displayed on the terminal
How to run Riva’s TTS?Setting your output devices and sample rate (the default is 44100), you can check the input device by using the following command:
python3 talk.py --list-devices
then run the scripts/tts/talk.py
.sh
python3 talk.py --output-device <device_number> --sample-rate-hz <sample_rate>
Now when you type text on the terminal it will be converted to speech and spoken out through the speaker
How to use OpenAI API?First, sign in your OpenAI account, and then visit this page to create an API key
First,install the OpenAI API,using the following command in your terminal:
pip3 install openai
After that, we call openai.ChatCompletion.create()
This code is an example of using the OpenAI API chat feature
Create a new python script(Here we use vscode, you can refer here for more detail) and run the following code:
import openai
openai.api_key = "openai-api-key"#using you openai key here
model_engine = "gpt-3.5-turbo"
ans = openai.ChatCompletion.create(
model=model_engine,
messages=[{"role": "user", "content": "you question"},
{"role": "assistant", "content": "The answer to the previous question "}]#use "assistant" to maintain context
)
print(ans.choices[0].message)
The main input is the messages parameter. Messages must be an array of message objects, where each object has a role (either "system", "user", or "assistant") and content (the content of the message). Conversations can be as short as 1 message or fill many pages.
Assistant -assistant: Messages help to store previous replies. This is to sustain the conversation and provide context for the conversation。
You can refer to this for more details。
Overview of the key codeThis section shows the key code for speech-to-text, text-to-speech and wakeup Settings. For the complete code, please refer to the end of the document.
How can we get the returned information from microphone?
Open the mic stream as an iterato and then iterating over each response in asr_service.streaming_response_generator(). Use is_final to determine whether to finish a sentence
with riva.client.audio_io.MicrophoneStream(
args.sample_rate_hz,
args.file_streaming_chunk,
device=args.input_device,
) as stream:
for response in asr_service.streaming_response_generator(
audio_chunks=stream,
streaming_config=config,
):
for result in response.results:
if result.is_final:
transcripts = result.alternatives[0].transcript # print(output)
output = transcripts
How do we turn text into speech output?
Setting the parameters (sample_rate_hz and output_device)
args1 = argparse.Namespace()
args1.language_code = 'en-US'
args1.sample_rate_hz = 48000 ##You can check the sample-rate of your own device to replace it
args1.stream = True #this shoule be true
args1.output_device = 24 #You can check the port number of your own device to replace it
service = riva.client.SpeechSynthesisService(auth)#This code is request the Riva server to synthesize the language
nchannels = 1
sampwidth = 2
sound_stream = None
We call riva.client.audio_io.SoundCallBack() function to create a sound
stream, then call service.synthesize_online() to syntheric voice.
try:
if args1.output_device is not None:
#For playing audio during synthesis you will need to pass audio chunks to riv a.client.audio_io.SoundCallBack as they arrive.
sound_stream = riva.client.audio_io.SoundCallBack(
args1.output_device, nchannels=nchannels, sampwidth=sampwidth,
framerate=args1.sample_rate_hz
)
if args1.stream:
#responses1 is the speech returned after synthesis,returning as an iterator
responses1 = service.synthesize_online(
answer, None, args1.language_code, sample_rate_hz=args1.sample_rate_hz
)
#Playing speech iteratively
for resp in responses1:
if sound_stream is not None:
sound_stream(resp.audio)
finally:
if sound_stream is not None:
sound_stream.close()
How to set wake words and sleep words?
You can modify this part of the code to set your own wake-up word
if output == "hello ":#You can specify your wake-up word here,and remember to add a space after it
is_wakeup = True
anSwer('here', auth)
output = ""
if output == "stop " and is_wakeup == True: #You can specify your pause word here,and remember to add a space after it
is_wakeup = False
anSwer('Bye! Have a great day!', auth)
output = ""
Run CodeThe parameter of input-device and sample-rate-hz should be replaced with your own
python3 <yourfilename.py> --input-device 24 --sample-rate-hz 48000
results show
Comments