To develop a voice assistant, we usually have two options to consider: using a local large language model with local speech-to-text and text-to-speech libraries or utilizing online speech service for speech-to-text and text-to-speech functionality, along with online LLM, such as Azure OpenAI's GPT-4, GPT 3.5. Here are the advantages and disadvantages of both options:
Option 1: Local Speech Service and LLM
Advantages:
Data privacy: by processing data locally, we can ensure better privacy and security for users' information.
No reliance on Internet connectivity: the voice assistant will work even without an active Internet connection, which may be beneficial in certain circumstances or locations.
No API costs: As we are not using any external APIs, there are no additional costs associated with API usage.
Disadvantages:
Resource constraints: large models require significant computational resources, which may not be available on all devices, leading to limited performance or compatibility.
Maintenance: we will be responsible for keeping your model and libraries updated, which may require additional time and effort.
Limited scalability: local processing could limit the number of users or simultaneous requests that the voice assistant can handle.
Option 2: Online Speech Service and LLM
Advantages:
Scalability: online services are designed to scale with demand, enabling the voice assistant to handle more users and requests concurrently.
Lower resource requirements: by offloading processing to the cloud, we can reduce the computational requirements for users' devices.
Automatic updates: online services are typically updated and maintained by their owners, which means you do not need to worry about keeping your models and libraries up-to-date.
Disadvantages:
Internet dependency: your voice assistant will require an active internet connection to function, which may not be available or reliable in all situations.
Data privacy concerns: by using cloud services, user data is transmitted and processed externally, which may raise privacy and security concerns for some users.
API costs: utilizing online services may incur additional costs based on usage, which could impact your project's budget.
Hence, choosing between these two options depends on the priorities, such as data privacy, scalability, resource requirements, and costs.
2. Consideration for running LLM on NVIDIA's Jetson series of development boardsDeveloping a personal voice assistant that incorporates speech-to-text, Large Language Models (LLM), and Text-to-speech functionalities requires significant computational resources, especially for the LLM component. NVIDIA's Jetson series of development boards are designed for edge computing applications, including AI and machine learning tasks. However, their capability to run large language models varies significantly across the series. Here’s an analysis from multiple perspectives, including hardware, software, and other factors.
2.1. Hardware Capabilities
Jetson Nano: The Jetson Nano is the entry-level board in the series. With a 128-core Maxwell GPU and 4 GB of RAM, it's suitable for basic AI tasks and prototypes but lacks the computational power for running large language models efficiently in real time. It could potentially run smaller models or simplified versions with considerable latency.
Jetson TX2: A step up, the TX2 features a 256-core Pascal GPU and up to 8 GB of RAM. While more powerful than the Nano, it still faces limitations for extensive LLM tasks. It's more suited to mid-level AI applications and could manage smaller LLMs with optimizations.
Jetson Xavier NX: This board strikes a balance between performance and power consumption. With a 384-core Volta GPU and 8 GB of RAM, it's capable of handling more demanding tasks. For LLMs, lightweight models or distillations might run effectively, especially with optimization and quantization techniques.
Jetson AGX Xavier: The AGX Xavier is a significant leap forward, offering a 512-core Volta GPU and 16-32 GB of RAM. It's designed for high-performance edge AI applications and could potentially run certain large language models, albeit with careful management of resources and model optimizations.
Jetson AGX Orin: The latest and most powerful in the series, with an Ampere architecture GPU up to 2048 cores and up to 64 GB of RAM. This board is the most likely candidate to run large language models locally, thanks to its substantial computational power and memory capacity. However, even here, efficiency and latency would depend on the specific demands of the LLM and the optimization of the model.
2.2. Software and Model Optimization
Running LLMs efficiently on Jetson boards requires model optimization. Techniques like quantization, pruning, and knowledge distillation can significantly reduce the model size and computational requirements, making it more feasible to deploy LLMs on edge devices.
2.3. Use Case and Performance Expectations
The feasibility also depends on the specific use case and performance expectations. For applications requiring real-time interaction, such as a voice assistant, latency becomes a critical factor. Smaller, optimized models may provide a balance between performance and responsiveness.
2.4. External Factors
Consideration should also be given to external factors such as power consumption, thermal management, and the availability of supporting software and tools for model optimization and deployment.
In summary, while the Jetson Nano and TX2 might struggle with real-time LLM applications due to their limited computational resources, the Xavier NX offers a middle ground for optimized models. The AGX Xavier and especially the AGX Orin present the most viable options for running LLMs locally, with the latter offering the highest potential. However, success in deploying LLMs on these devices depends heavily on model optimization and the specific requirements of the application.
According to the above analysis, Jetson Nano is quite difficult to run LLM locally on the device since the computational resources are quite limited. In this tutorial, we will build an LLM-based voice assistant with Jetson Nano based on Azure Speech and Azure OpenAI services.
The Jetson Nano Developer kit we use is 4G memory version and the Jetpack version is 4.4.1, as shown in Figure 1.
Since Jetson Nano is not equipped with any on-board speaker or microphone, we have to choose a speaker and a microphone. There are two options, one is to buy a USB speaker and a USB microphone, while the other is to buy a USB device that includes both speaker and microphone. In this project, we will use a MAXHUB speakerphone, supporting connection via Bluetooth, a USB cable, or an audio cable. On Jetson Nano, we can use “lsusb” to review the hardware that is connected to the device via the USB interface as shown in Figure 2.
If the BM21 omnidirectional microphone is connected by USB, after entering the Ubuntu system, click Settings in the upper right corner to enter the system Settings, select Sound in the left column, and set the microphone and speaker of the system. Select the corresponding hardware devices in the Input Device and Output Device respectively, as shown in Figure 3.
The default Python version on Jetson Nano is 3.6.9. However, according to the Azure Speech SDK page, the minimum Python version is 3.7. Thanks to Miniconda, it is a small bootstrap version of Anaconda that includes only conda, Python, the packages they both depend on, and a small number of other useful packages (like pip, zlib, and a few others). In this project, we install Miniconda3 py37_4.9.2-Linux-aarch64 64-bit version. After installation, we will implement this project with the virtual Python environment. If Miniconda is successfully installed on the Jetson Nano, we will enter the base virtual environment by default once we open the terminal. And we can make use of “pip list” command to see the installed Python packages as follows.
In the virtual environment, we can use the pip tool to install Azure Speech and OpenAI SDK,
pip install azure-cognitiveservices-speech
pip install openai
We can use “pip show” to learn the detailed information of the packages as shown in Figure 5.
However, if we use the default environment to install the azure-cognitiveservices-speech package, we will get “No matching distribution” error as shown in Figure 6.
There are two ways to create Azure Speech services, one is to create Azure AI services, which contain a variety of AI services such as Vision, Language, Search, and Speech, and you can do it with a single resource. Another way is to create a separate Azure Speech service, with the benefit that it includes a free F0 level, but Azure AI Services only has the Standard S0 level. When creating a new resource, search for Speech, select Azure Speech Services, and then set up, as shown in Figure 7. It should be noted that you should try to choose the region close to you, of course. From the service point of view, East US, West Europe, and Southeast Asia, these regions support the most complete features.
This project uses Azure OpenAI services to create a large language model and leverage that model to interact with users. Of course, if users have OpenAI services on hand, they can also use them directly. The Azure OpenAI service is used here as an example. At present, the Azure OpenAI service needs to apply for use (application address: https://aka.ms/oai/access ). Once the application is approved, you can create the Azure OpenAI service. Navigate to the Azure portal, search for Azure OpenAI in Resource Creation on the home page, and click Create. After that, you can refer to this document on MS Learn to create an OpenAI service: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/create-resource. It should be noted that at present, the large language model is divided into multiple versions, and the models supported by different regions are different, so it is also necessary to establish OpenAI service resources according to the specific model requirements. For specific model information, check out the documentation on MS Learn: https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models. After the OpenAI service is created, Users can go to Azure OpenAI Studio to create and deploy models, as shown in Figure 8.
“Keyword” is a word or short phrase within a stream of audio. The most common use case of keyword recognition is the voice activation of virtual assistants. For example, "Hey Cortana" is the keyword for the Cortana assistant. Generally, virtual assistants are always listening. Keyword recognition acts as a privacy boundary for the user. A keyword requirement acts as a gate that prevents unrelated user audio from crossing the local device to the cloud.
With the Custom Keyword portal on Speech Studio(https://speech.microsoft.com/customkeyword), we can generate keyword recognition models that execute at the edge by specifying any word or short phrase. We can further personalize the keyword model by choosing the right pronunciations. The most attractive point is that there's no cost to use custom keyword to generate models, including both Basic and Advanced models.
There is a tutorial to guide us to create a keyword in Speech Studio(https://learn.microsoft.com/en-us/azure/ai-services/speech-service/custom-keyword-basics?pivots=programming-language-python ). It is provided with C#, C++, Go, Java, Python, Objective-C and other languages. Before you can use a custom keyword, you need to create a keyword using the Custom Keyword page on Speech Studio. After you provide a keyword, it produces a “.table” file that you can use with the Speech SDK. In this project, I have created a Custom Keyword “小智” in Chinese. We can check the created model on the Models page as shown in Figure 9.
The Python code is available in the GitHub repository: https://github.com/shijiong/GPT4VoiceAssistantForRaspberry. Download “VoiceAssistantKeyphrase.py” and modify the configuration parameters as follows:
azure_endpoint in line 8
api_key in line 9
api version in line 10
speech_key in line 14
service_region in line 15
speech_config.speech_synthesis_language in line 18
speech_config.speech_recognition_language in line 20
speech_config.speech_synthesis_voice_name in line 23
For more information on supported languages, we can refer to this page: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt.
10. Run the CodeIf you have an IDE such as Visual Studio Code installed, you can open the source code. Configure the Python Interpreter with the Miniconda environment that you created as shown in Figure 10.
Now we can run or debug the application as shown in Figure 11. The debug information will be printed on Terminal window.
Comments