1 Introduction
This document chronicles the evolution of a versatile digital assistant designed to transcend linguistic boundaries and platform constraints, thus offering a plethora of functionalities. Equipped to converse fluently in several languages, this sophisticated system is capable of guiding tourists through the vibrant streets of Macau, serving as a virtual host for live broadcasts across prominent platforms like Bilibili, and offering solace through psychological counseling for those grappling with anxiety and melancholy. At the heart of this innovation lies our commitment to crafting an interaction that is not only effortless and natural but also profoundly transformative for individuals with visual impairments. We endeavor to redefine the digital assistance paradigm, ensuring inclusivity and enhanced convenience for all users.
2 System Architecture Overview
Building on the core functionalities of voice interaction through ASR and TTS, our system is enhanced with multilingual capabilities, sophisticated model management, and diverse front-end options. The system is detailed through two main diagrams, illustrating the complex interconnectivity and workflows.
2.1 Detailed System Architecture
2.1.1 Front-End Variability
The system supports multiple front-ends, including a web interface, hardware integration with Core S3, and robotics control through KR260. This allows for a broad range of applications from desktop interaction to mobile integration and robotics automation.
2.1.2 API Gateway
The middleware that routes requests, manages services like ASR, Chat, TTS, and interfaces with the model back-end and database.
Multilingual Interaction
Our assistant is not limited to English. It supports Mandarin and Cantonese, which are essential for the Macau tourism application, switching seamlessly based on user input.
Session and Context Management
Each interaction is tracked with session IDs, preserving the history and context in the database. This ensures continuity and personalization in multi-turn dialogues.
2.1.3 Model Back-End
The Model Back-End houses various models including ASR, LLM, and TTS, and also facilitates the integration of additional models for extended functionalities, such as a customizable OpenAILike library for prompt tuning and continuous updates to websocket transmission capacities for robust communication channels.
2.1.4 Cloud and Hardware Acceleration
The assistant utilizes GPU clusters in the cloud for model computations, with AMD ROCm acceleration for TTS models, and leverages Docker for deployment and research into FunASR technology.
2.2 Expanding on the Macau Tourism Digital Persona
The assistant is customized with a specialized prompt for Macau tourism, incorporating local knowledge and tourist information into its responses.
Interactive guidance is enhanced through rich ASR and TTS capabilities, providing tourists with an engaging way to explore Macau.
The system can be further extended to include visual recognition and processing, allowing it to interact with images and videos related to Macau’s tourist spots.
3 Technical Implementation
3.1 Model Back-End Implementation
Our back-end team focused on integrating a suite of models crucial for the digital assistant's functionality. Drawing inspiration from AI Vtuber and Digital Life projects on GitHub, we incorporated models for ASR, LLM, TTS, and VLM/MLLM into our system. We deployed the latest Aliyun's FunASR technology using Docker in our system.
A significant breakthrough came with optimizing TTS processing. Initially operating on CPU, the workload for TTS was transitioned to GPU, utilizing CUDA acceleration to enhance performance. This improvement has cut down the TTS processing time to a mere one to two seconds. The steps for this transition are as follows:
If use CPU:
If use GPU:
Install environment:
To enhance conversational capabilities, the 'history' feature was incorporated, enabling the digital assistant to maintain the context over multiple dialogue turns.
3.2 API Gateway Engineering
The gateway team was tasked with developing a robust system to handle user context, manage user sessions, and streamline service flow. Key implementations included:
Context Management: Ensuring the digital assistant can remember and reference past interactions within a session.
User Management: Allowing individual user preferences and history to influence the interaction flow.
Service Flow: A logical sequence of processes that manage the lifecycle of each user request.
To support the high-volume data traffic, we increased the websocket's byte transmission capacity, enhancing the system's ability to handle dense information packets seamlessly.
3.3 Front-End Design and Integration
Three distinct front-end interfaces were created, tailored to different user interaction platforms:
Web Interface: Designed for PC users, it includes features like user account login, voice and text input, and a visual representation of the digital persona. The input method versatility caters to a wide user base, from those who prefer typing to those who utilize voice recognition.
Dataset:
Digital conversation:
Video:
Multi-languages:
Core S3 Hardware: Set up to cater to dedicated devices, providing a seamless integration into the broader system.
KR260 Robotics: Allows for the digital assistant to be embodied in a robot, enabling physical interaction and mobility.
One of the standout features of the web interface is the ability to switch between different avatars and prompts, thus personalizing the user experience.
Modify the prompt for a travel bot as follows:
4 Expanded Bill of Materials (BOM)
Hardware:
CoreS3, KR260,Microphones and Audio Processing Units
Software:
Open-source libraries for machine learning (TensorFlow, PyTorch), computer vision (OpenCV), Large Language Models like BERT and so on.
5 Conclusion
This documentation encapsulates the journey of creating a state-of-the-art digital assistant that pushes the boundaries of technology and accessibility. By integrating advanced multilingual capabilities, robust model management, and diverse interfacing options, we have crafted a tool that not only enhances user experiences across various platforms—including tourism, live broadcasting, and psychological support—but also sets a new standard for digital interaction. Future expansions and continuous improvements aim to further elevate its capabilities, ensuring that our digital assistant remains at the forefront of technological innovation and user inclusivity.
Reference
Comments