Published May 17, 2023 © LGPL

Build an AI Chatbot Using Riva and OpenAI

Using NVIDIA Riva combined with OpenAI API to create an interactive chatbot that is deployed on an NVIDIA Jetson edge device.

IntermediateFull instructions provided24 hours2,248

Build an AI Chatbot Using Riva and OpenAI

Things used in this project

Hardware components

Seeed Studio reComputer J4012-Edge AI Device with NVIDIA Jetson Orin™ NX 16GB module

USB microphone component

Story

This project used NVIDIA Riva and OpenAI API, deployed on an NVIDIA Jetson device, to build an interactive chatbot. It has the following features:

Wakeup worddetection
Continuous conversation

Let's look at the architecture

First, the speech input from the microphone is converted into text using Riva's Automatic Speech Recognition (ASR) library, and then it is passed to the OpenAI API. When the OpenAI API returns the result, the text is converted into speech using Riva's Text-to-Speech (TTS) library, and it is output through the microphone.

What is Riva?

Riva is a speech processing platform developed by NVIDIA that helps developers build powerful speech applications. It offers various speech processing capabilities, including Automatic Speech Recognition (ASR), Text-to-Speech (TTS), Nature Language Processing(NLP), Neural Machine Translation(NMT), and speech synthesis. It utilizes NVIDIA's GPU acceleration technology, ensuring high performance even under heavy workloads and provides user-friendly API interfaces and SDK tools, making it easy for developers to build speech applications. Riva offers pretrained speech models in NVIDIA NGC™ that can be fine-tuned with the NVIDIA NeMo on a custom data set, accelerating the development of domain-specific models by 10x.

What is ASR in Riva？

Riva's ASR (Automatic Speech Recognition) is an advanced technology developed by NVIDIA. It accurately converts spoken language into written text using deep learning models and algorithms. It is widely used for real-time transcription, voice commands, and other speech-to-text applications.

What is TTS in Riva？

Riva's TTS (Text-to-Speech) is an advanced technology that generates high-quality, natural-sounding speech from written text. It uses deep learning techniques to produce human-like speech with accurate pronunciation and expression. Developers can customize parameters to achieve desired voice characteristics. It is used in applications like virtual assistants, audiobooks, and accessibility solutions.

What is OpenAI API?

The OpenAI API enables developers to integrate advanced natural language processing capabilities into their applications. It provides access to powerful language models that can generate human-like text based on prompts. Developers can make requests to the API, receiving generated text as output for tasks like text generation, translation, and more.

Where will we deploy it?

We will deploy the above solution on a ReComputer J4012 which is built with Jetson Orin NX 16GB, a powerful and compact intelligent edge box to bring up to 100 TOPS modern AI performance to the edge, which offers up to 5X the performance of Jetson Xavier NX and up to 3X the performance of Jetson AGX Xavier. Combining the NVIDIA Ampere™ GPU architecture with 64-bit operating capability, Orin NX integrates advanced multi-function video and image processing, and NVIDIA Deep Learning Accelerators.

Deploy the Riva server

Step 1 Flash JetPack to reComputer J4012

First of all, you should flash jetpack, More detail are available here.

Step 2 Look for the adapted Riva version

In this case, we using Riva 2.11.0 on embedded platforms. JetPack 5.1 or JetPack 5.1.1 is needed.You can refer to Support Matrix.
You have ~15 GB free disk space on Jetson as required by the default containers and models. If you are deploying any Riva model intermediate representation (RMIR) models, the additional disk space required is ~14 GB plus the size of the RMIR models
You have enabled the following power modes on the Jetson platform. These modes activate all CPU cores and clock the CPU/GPU at maximum frequency for achieving the best performance.

sudo nvpmodel -m 0 (Jetson Orin AGX, mode MAXN)
sudo nvpmodel -m 0 (Jetson Xavier AGX, mode MAXN)
sudo nvpmodel -m 2 (Jetson Xavier NX, mode MODE_15W_6CORE)

You have set the default runtime to nvidia on the Jetson platform by adding the following line in the /etc/docker/daemon.json file. Restart the Docker service using sudo systemctl restart docker after editing the file

"default-runtime": "nvidia"

Step 3 Install CLI tools

open a terminal, enter:

wget --content-disposition https://ngc.nvidia.com/downloads/ngccli_arm64.zip && unzip ngccli_arm64.zip && chmod u+x ngc-cli/ngc

Check the binary's md5 hash to ensure the file wasn't corrupted during download:

find ngc-cli/ -type f -exec md5sum {} + | LC_ALL=C sort | md5sum -c ngc-cli.md5

Add your current directory to path:

Add your current directory to path:echo "export PATH=\"\$PATH:$(pwd)/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile

Or create a symlink:

ln -s $(pwd)/ngc-cli/ngc [destination_path]/ngc

You must configure NGC CLI for your use so that you can run the commands.

Enter the following command, including your API key when prompted:

ngc config set

Step 4 Local Deployment Using Quick Start Scripts

use the NGC CLI tool to download from the command line.

ngc registry resource download-version nvidia/riva/riva_quickstart_arm64:2.11.0

Initialize and start Riva. The initialization step downloads and prepares Docker images and models. The start script launches the server.

Modify the config.sh file within the quickstart directory with the below configurations:

In the example below, TTS and ASR are true, which enables the text2speech and ASR services. The NMT and NLP services are false, which disables these services.

# Enable or Disable Riva Services
service_enabled_tts=true
service_enabled_asr=true
service_enabled_nmt=false
service_enabled_nlp=false

The models are installed and configured if they are uncommented in config.sh and the corresponding service is enabled.

Once the configuration is complete, enter the following command:

cd riva_quickstart_arm64_v2.11.0

Initialize and start Riva:

bash riva_init.sh
bash riva_start.sh

How to run Riva’s ASR？

setting your input devices and sample rate (the default is 16000), you can check the input device by using the following command:

python3 transcribe_mic.py --list-devices

then run the scripts/asr/transcribe_mic.py

python3 transcribe_mic.py --input-device <device_number> --sample-rate-hz <sample_rate>

Now when you talk to the microphone, the speech will be converted to text and displayed on the terminal

How to run Riva’s TTS？

Setting your output devices and sample rate (the default is 44100), you can check the input device by using the following command:

python3 talk.py --list-devices

then run the scripts/tts/talk.py

.sh
 python3 talk.py --output-device <device_number> --sample-rate-hz <sample_rate>

Now when you type text on the terminal it will be converted to speech and spoken out through the speaker

How to use OpenAI API?

First, sign in your OpenAI account, and then visit this page to create an API key

First，install the OpenAI API，using the following command in your terminal：

pip3 install openai

After that, we call openai.ChatCompletion.create()

This code is an example of using the OpenAI API chat feature

Create a new python script(Here we use vscode, you can refer here for more detail) and run the following code:

import openai

openai.api_key = "openai-api-key"#using you openai key here
model_engine = "gpt-3.5-turbo"
ans = openai.ChatCompletion.create(
    model=model_engine,
    messages=[{"role": "user", "content": "you question"},
          {"role": "assistant", "content": "The answer to the previous question "}]#use "assistant" to maintain context
)

print(ans.choices[0].message)

The main input is the messages parameter. Messages must be an array of message objects, where each object has a role (either "system", "user", or "assistant") and content (the content of the message). Conversations can be as short as 1 message or fill many pages.

Assistant -assistant: Messages help to store previous replies. This is to sustain the conversation and provide context for the conversation。

You can refer to this for more details。

Overview of the key code

This section shows the key code for speech-to-text, text-to-speech and wakeup Settings. For the complete code, please refer to the end of the document.

How can we get the returned information from microphone?

Open the mic stream as an iterato and then iterating over each response in asr_service.streaming_response_generator(). Use is_final to determine whether to finish a sentence

with riva.client.audio_io.MicrophoneStream(
                        args.sample_rate_hz,
                        args.file_streaming_chunk,
                        device=args.input_device,
                ) as stream:
                    for response in asr_service.streaming_response_generator(
                            audio_chunks=stream,
                            streaming_config=config,
                    ):
                        for result in response.results:
                            if result.is_final:
                                transcripts = result.alternatives[0].transcript  # print(output)
                                output = transcripts

How do we turn text into speech output?

Setting the parameters (sample_rate_hz and output_device)

args1 = argparse.Namespace()
args1.language_code = 'en-US'
args1.sample_rate_hz = 48000 ##You can check the sample-rate of your own device to replace it
args1.stream = True #this shoule be true
args1.output_device = 24 #You can check the port number of your own device to replace it
service = riva.client.SpeechSynthesisService(auth)#This code is request the Riva server to synthesize the language
nchannels = 1
sampwidth = 2
sound_stream = None

We call riva.client.audio_io.SoundCallBack() function to create a sound

stream, then call service.synthesize_online() to syntheric voice.

try:
    if args1.output_device is not None:
        #For playing audio during synthesis you will need to pass audio chunks to riv          a.client.audio_io.SoundCallBack as they arrive.
        sound_stream = riva.client.audio_io.SoundCallBack(
            args1.output_device, nchannels=nchannels, sampwidth=sampwidth,
            framerate=args1.sample_rate_hz
        )
    if args1.stream:
        #responses1 is the speech returned after synthesis,returning as an iterator
        responses1 = service.synthesize_online(
            answer, None, args1.language_code, sample_rate_hz=args1.sample_rate_hz
        )
        #Playing speech iteratively
        for resp in responses1:
            if sound_stream is not None:
                sound_stream(resp.audio)
finally:
    if sound_stream is not None:
        sound_stream.close()

How to set wake words and sleep words?

You can modify this part of the code to set your own wake-up word

if output == "hello ":#You can specify your wake-up word here,and remember to add a space after it
    is_wakeup = True
    anSwer('here', auth)
    output = ""
if output == "stop " and is_wakeup == True: #You can specify your pause word here,and remember to add a space after it
    is_wakeup = False
    anSwer('Bye! Have a great day!', auth)
    output = ""

Run Code

The parameter of input-device and sample-rate-hz should be replaced with your own

python3 <yourfilename.py> --input-device 24 --sample-rate-hz 48000

results show

The complete code

import argparse
from typing import List, Iterable
import riva.client.proto.riva_asr_pb2 as rasr
import riva.client
from riva.client.argparse_utils import add_asr_config_argparse_parameters, add_connection_argparse_parameters
import openai
import riva.client.audio_io
import time

#This is the part of typing the command line
#Only the input device and the sampling rate need to be specified
def parse_args() -> argparse.Namespace:
    default_device_info = riva.client.audio_io.get_default_input_device_info()
    default_device_index = None if default_device_info is None else default_device_info['index']
    parser = argparse.ArgumentParser(
        description="Streaming transcription from microphone via Riva AI Services",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
    )
    parser.add_argument("--input-device", type=int, default=default_device_index, help="An input audio device to use.")
    parser.add_argument("--list-devices", action="store_true", help="List input audio device indices.")
    parser = add_asr_config_argparse_parameters(parser, profanity_filter=True)
    parser = add_connection_argparse_parameters(parser)
    parser.add_argument(
        "--sample-rate-hz",
        type=int,
        help="A number of frames per second in audio streamed from a microphone.",
        default=16000,
    )
    parser.add_argument(
        "--file-streaming-chunk",
        type=int,
        default=1600,
        help="A maximum number of frames in a audio chunk sent to server.",
    )
    args = parser.parse_args()
    return args

#This function is used to make a speech using the microphone. "answer" is the content of the speech, which you can change
#These codes are modified based on riva's tutorials,you can get more details on it on github https://github.com/nvidia-riva/python-clients/tutorials
def anSwer(answer,auth):
    
    args1 = argparse.Namespace()
    args1.language_code = 'en-US'
    args1.output_divece = 24
    args1.sample_rate_hz = 48000
    args1.stream = True
    args1.output_device = 24
    service = riva.client.SpeechSynthesisService(auth)
    nchannels = 1
    sampwidth = 2
    sound_stream = None
    try:
        if args1.output_device is not None:
            #For playing audio during synthesis you will need to pass audio chunks to riva.client.audio_io.SoundCallBack as they arrive.
            sound_stream = riva.client.audio_io.SoundCallBack(
                args1.output_device, nchannels=nchannels, sampwidth=sampwidth,
                framerate=args1.sample_rate_hz
            )
        if args1.stream:
            responses1 = service.synthesize_online(
                answer, None, args1.language_code, sample_rate_hz=args1.sample_rate_hz
            )
            for resp in responses1:    
                if sound_stream is not None:
                    sound_stream(resp.audio)
    finally:
        if sound_stream is not None:
            sound_stream.close()

def main() :
    output = ""  
    answer = ""
    openai.api_key = "openai-api-key"#using you openai key here
    model_engine = "gpt-3.5-turbo"
    
    args = parse_args()
    #the args is used to specifed the speech output
    args1 = argparse.Namespace()
    args1.language_code = 'en-US'
    args1.output_divece = 24
    args1.sample_rate_hz = 48000
    args1.stream = True
    
    if args.list_devices:
        devices = riva.client.audio_io.list_input_devices()
        output += str(devices) + "\n"  
        return output
    auth = riva.client.Auth(args.ssl_cert, args.use_ssl, args.server)
    asr_service = riva.client.ASRService(auth)
    config = riva.client.StreamingRecognitionConfig(
        config=riva.client.RecognitionConfig(
            encoding=riva.client.AudioEncoding.LINEAR_PCM,
            language_code=args.language_code,
            max_alternatives=1,
            profanity_filter=args.profanity_filter,
            enable_automatic_punctuation=args.automatic_punctuation,
            verbatim_transcripts=not args.no_verbatim_transcripts,
            sample_rate_hertz=args.sample_rate_hz,
            audio_channel_count=1,
        ),
        interim_results=True,
    )
    riva.client.add_word_boosting_to_config(config, args.boosted_lm_words, args.boosted_lm_score)
    is_close = False
    is_wakeup = False
    while True:
        #Use iterators to receive mic's stream
        if not is_close:
            with riva.client.audio_io.MicrophoneStream(
                    args.sample_rate_hz,
                    args.file_streaming_chunk,
                    device=args.input_device,
            ) as stream:
                try:
                    for response in asr_service.streaming_response_generator(
                            audio_chunks=stream,
                            streaming_config=config,
                    ):
                        for result in response.results:
                            if result.is_final:
                                transcripts = result.alternatives[0].transcript  # print(output)
                                output = transcripts
                        if  output != '':  
                            if output == "hello ":#You can specify your wake-up word here,and remember to add a space after it
                                is_wakeup = True
                                anSwer('here', auth)
                                output = ""
                            if output == "stop " and is_wakeup == True: #You can specify your pause word here,and remember to add a space after it
                                is_wakeup = False
                                anSwer('Bye! Have a great day!', auth) 
                                output = ""
                            if is_wakeup == True and output != '':
                                print("ask:", output)
                                stream.close()
                                is_close = True
                                ans = openai.ChatCompletion.create(
                                    model=model_engine,
                                    messages=[{"role": "user", "content": output},
                                              {"role": "assistant", "content": answer}]#use "assistant" to maintain context
                                )
                                output = ''
                                answer = ans.choices[0].message["content"]
                                print("AI:", answer)
                                args1.output_device = 24
                                args1.sample_rate_hz = 48000

                                service = riva.client.SpeechSynthesisService(auth)
                                nchannels = 1
                                sampwidth = 2
                                sound_stream = None
                                try:
                                    if args1.output_device is not None:
                                        sound_stream = riva.client.audio_io.SoundCallBack(
                                            args1.output_device, nchannels=nchannels, sampwidth=sampwidth,
                                            framerate=args1.sample_rate_hz
                                        )
                                    start = time.time()
                                    if args1.stream:
                                        responses1 = service.synthesize_online(
                                            answer, None, args1.language_code, sample_rate_hz=args1.sample_rate_hz
                                        )
                                        first = True
                                        #print(responses)
                                        for resp in responses1:
                                            stop = time.time()
                                            if first:
                                                print(f"Time to first audio: {(stop - start):.3f}s")
                                                first = False
                                            if sound_stream is not None:
                                                #print("a:",time.time())
                                                sound_stream(resp.audio)
                                                #print("b:",time.time())
                                                
                                finally:
                                    if sound_stream is not None:
                                        sound_stream.close()
                                        #mic_closed = False
                                        is_close = True
                                break
                            
                finally:
                    is_close = False
        else:
            is_close = False





if __name__ == '__main__':
    main()

Credits

Build an AI Chatbot Using Riva and OpenAI

Things used in this project

Hardware components

Story

Let's look at the architecture

What is Riva?

What is ASR in Riva？

What is TTS in Riva？

What is OpenAI API?

Where will we deploy it?

Deploy the Riva server

Step 1 Flash JetPack to reComputer J4012

Step 2 Look for the adapted Riva version

Step 3 Install CLI tools

Step 4 Local Deployment Using Quick Start Scripts

How to run Riva’s ASR？

How to run Riva’s TTS？

How to use OpenAI API?

Overview of the key code

Run Code

results show

Code

The complete code

Credits

xx w

Comments

Embed the widget on your own site

Build an AI Chatbot Using Riva and OpenAI

Build an AI Chatbot Using Riva and OpenAI

Things used in this project

Hardware components

Story

Let's look at the architecture

What is Riva?

What is ASR in Riva？

What is TTS in Riva？

What is OpenAI API?

Where will we deploy it?

Deploy the Riva server

Step 1 Flash JetPack to reComputer J4012

Step 2 Look for the adapted Riva version

Step 3 Install CLI tools

Step 4 Local Deployment Using Quick Start Scripts

How to run Riva’s ASR？

How to run Riva’s TTS？

How to use OpenAI API?

Overview of the key code

Run Code

results show

Code

The complete code

Credits

xx w

Comments

Related channels and tags