It is rare that a speech processing application is implemented on a Zynq-based device because on one hand, those neural network models are typically large, and on the other hand, Vitis-AI has limited support for 1D convolution, which is crucial to many speech models. In this project, online automatic speech recognition is implemented on the Kria KV260 board. A speech-to-text (STT) QuartzNet is deployed with ONNX Runtime to the Arm processors and a quantized MarbleNet for voice activity detection (VAD) is deployed with Vitis-AI. These two models perform collaborative inference to transcribe the voice and optimize the congestion of STT caused by continuous long audio input. At the same time, a graphical application is implemented for friendly interaction with KV260.
Project Purpose and ChallengeIt’s always been a problem that people with hearing impairment are unable to appreciate video media, and a lot of media providers do not provide captions. The problem is even worse in live streams without sign language support provided. Thus, we intend to propose an automatic speech recognition (ASR) solution to generate captions in real-time to aid people with hearing impairment to appreciate video media and live streams. The target language is standard British or American English.
There is a trend that ASR tools are deployed on edge devices to make them portable. For example, the Caldi speech recognition tool is installed and runs on Raspberry Pi in [1]. Another project [2] focuses on the Voice2JSON ASP tool and it is also deployed on Raspberry Pi. Moreover, one project [3] realizes the ASR applications on the AI edge computer Nvidia Jetson Nano. However, these projects usually suffer from a slow inference speed due to the limited throughput of CPU, or a high power consumption due to the use of GPU.
As the implementation of online inference of ASR on edge is still a challenge, in this project, KV260 is chosen as the platform to take advantage of its high parallelism and low power consumption. By dividing our STT algorithm and the preprocessing part to be deployed on the processing subsystem (PS) and the VAD part to be accelerated on the programmable logic (PL), the computation effort is balanced and thus we can get a lower inference latency while also keeping the power consumption at an acceptable level. In other words, the advantage of using heterogeneous multiprocessing platform (MPSOC) is exploited.
CNN ModelsThere are two CNN models are used in this project and converted from NeMo. And some useful scripts and modules in NeMo and Xilinx's brevitas are adopted by us for easier deployment on edge devices.
- Speech to text model - QuartzNet
QuartzNet[4] is a lightweight end-to-end speech recognition model, which is a variant of Jasper that uses 1D time-channel separable convolutional layers, which only contains 18.9M parameters on the 5x15 version and has over 95% accuracy on LibriSpeech dev-other dataset. So, with high throughput and accuracy, QuartzNet can provide frame-level speech-to-text inference and is suitable for applications on edge devices that have limited storage and computing capacity. The voice feature for QuartzNet is the Mel Spectrogram, and a sample of the feature is shown below. As QuartzNet is sensitive to the float input data, it is crucial to keep it as a float model and we deploy it as an ONNX model on the PS layer of the KV260.
- Voice activity detection model - MarbleNet
MarbleNet[5] is a lightweight end-to-end CNN model, which only contains 88k parameters in the smallest 3X2X64 version and provides frame-level voice activity prediction. And this model is training on Google Speech Command v2 (Speech) and Freesound (Background) datasets and achieves 94.1% accuracy on the test sets. The voice feature for MarbleNet is the Mel frequency cepstral coefficients (MFCC), which is the DCT result of the list of Mel log powers. An example of MFCC is shown below. Since MarbleNet is not very deep in architecture and has simple operators, we deploy it as a quantized model on the KV260 and enable Deep Learning Processor Unit (DPU) acceleration for its tasks.
- Model adapting for DPU
Since the Vitis AI Compiler accepts limited operators and VART Programming APIs have better support for Class 1 and 2 inferences, it is necessary to adjust and partition the model for friendly development.
- Converting 1D layers to 2D layers
The DPUCZDX8G for KV260 does not accept 1D convolutional layers when concerting a PyTorch model, we solve this problem by converting the 1D layers to 2D layers, a 2D convolutional layer is equivalent to a 1D layer when the extra dimension of the kernel_size
is 1, which is the same to the stride
. If padding is also applied, the extra padding
parameter should be 0, while the other parameters of layers remain the same.
For instance, the definition for a 1D convolutional layer is shown as follows:
self.n_Conv_0 = nn.Conv1d(**{'groups': 64, 'dilation': [1], 'out_channels': 64, 'padding': [16], 'kernel_size': (33,), 'stride': [2], 'in_channels': 64, 'bias': False})
By changing the parameters that are discussed above, it can be converted into a 2D convolutional layer with the output keeping the same.
self.n_Conv_0 = nn.Conv2d(**{'groups': 64, 'dilation': 1, 'out_channels': 64, 'padding': (16, 0), 'kernel_size': (33,1), 'stride': (2,1), 'in_channels': 64, 'bias': False})
For loading pre-trained weights for a new 2D layer, the weight should be expanded at the second dimension.
#v = torch.from_numpy(np.load(b))
v = torch.from_numpy(np.expand_dims(np.load(b),-1))
- Model Partition
As the architecture of MarbleNet is shown below, we can identify which part of the model is compilable and uncompilable. It is easy to see that Block 1 to Block B and Prologue parts are sufficient for the requirement of the Vitis AI Compiler. And the Epilogue cannot be compiled because its kernel width is too large. Since the Cross-Entropy function uses many custom operators, it will not be considered in the DPU subgraph either. Although the two layers between the Cross-Entropy function and Epilogue can be compiled with the Compiler, the newly generated DPU subgraph is not efficient for the whole system as it increases the data transmission between the PS and PL and it actually does not take too much resource to compute on CPU. Finally, the partition scheme is to separate the model at the end of Block B, and there will be only 1 DPU subgraph ideally.
- Quantizing and compiling the model
Scripts for model compilation are provided in the project repository. With the Vitis-AI 1.4 docker and XIR, we can build deployable model files for DPU from the PyTorch model through the following commands.
# Under the Vitis-AI 1.4 docker
conda activate vitis-ai-pytorch
python3 compile.py
vai_c_xir -x ./quantize_result/Model_U_int.xmodel -a /opt/vitis_ai/compiler/arch/DPUCZDX8G/KV260/arch.json -n vad_model_u
As the compiling result shows, with model partition, there are 3 subgraphs in total with only 1 DPU subgraph, which will make the model easy to deploy with the Vitis-AI Libraries. And with the xdputil
tool, we can also output the image of the compiled model to ensure the partition result and data shape are correct.
xdputil xmodel vad_model_u.xmodel -s output.svg
- Flow control
A flow control program is used to manage the buffer for recorded voice and communicates with the backend program by the socket protocol. When new audio is recorded, the audio file name will be sent to the queue of inference. As the result is output and send back to the flow control program, the outdated audio files will be deleted, and another new audio file name will be sent to the queue. With this method, we have a larger buffer for inferent and against congestion.
- Application
To get the voice signal, a USB microphone that supports 16 Ksps sample rate is needed. And the recorder will save the received PCM signal as WAV files and convert it to the PAM signal if the files are called by the flow control block.
As we have discussed before, VAD is used to optimize the congestion of speech-to-text algorithms and prevent the spreading of latency when the system input rate is larger than the system delay. So after getting the features with preprocessing, the VAD algorithms will detect the voice activity through DPU and Vitis-AI Runtime at a very fast speed and decide whether STT should be run or not.
In order to conveniently record and process the audio from the outside world, a QT GUI was built. And although without a graphical desktop, users can also run the graphical application through the X Window System. The GUI contains one text box for showing the result and one button conditions the microphone recorder to start and stop. And the socket protocol is used for separating the front-end and the back-end.
For reproducing this project, hardware and software requirement is listed in the Complete Bill of Materials. And please follow the guide below.
- Setting Edge
This project is based on the DPU-PYNQ environment, so firstly please follow Getting Started Tutorial to have Ubuntu installed with the KV260 board.
To install Kria-PYNQ on Ubuntu, you can refer to the following command lines (It will take about 20-30 minutes):
# Edge
git clone https://github.com/Xilinx/Kria-PYNQ.git
cd Kria-PYNQ/
sudo bash install.sh
Then, clone the project repository to your devices.
# Edge
cd /home/ubuntu/
git clone https://github.com/youjunl/fpga-asr.git
cd fpga-asr
And install dependencies under root
.
# Edge
su
pip3 install requirement_a.txt
pip3 install torchaudio==0.8.0 -f https://torch.kmtea.eu/whl/stable.html
pip3 install pynq_dpu
apt-get install python3-pyqt5 python3-pyaudio
Finally, plug in a USB microphone to the board, and the project setting is finished.
- Setting X11 For ROOT
X11 setting is very important for the users who have no graphical desktop installed onboard or an HDMI screen to visualize the QT applications. As X11 forwarding and authentication have been integrated for most operating systems, you will not need extra installation.
- EnableRoot Login for KV260
If you can do SSH login with root
and -X option, X11 will be set automatically. To do this, add PermitRootLogin no
to the file /etc/ssh/sshd_config
and restart.
# Edge
sudo chmod 777 /etc/ssh/sshd_config
vi /etc/ssh/sshd_config
Now you can try SSH with the host computer.
# Host
ssh -X 192.168.1.108 -l root
- (If themethod above does not work) Setting x11 for eachlogin process.
Reconnect the board with X11 forwarding open.
# Host
ssh -X 192.168.1.108 -l ubuntu
Check the authority key under the user ubuntu
.
ubuntu@kria:~$ xauth list
kria/unix:10 MIT-MAGIC-COOKIE-1 5a02fa6e7d6910124e8dcb470124f49e
Add the same key under the root
.
root@kria:/home/ubuntu# xauth add kria/unix:10 MIT-MAGIC-COOKIE-1 5a02fa6e7d6910124e8dcb470124f49e
Now, X11 forwarding is enabled.
- Running the program
In this guide, X11 is used for running the application, and optionally, you can run it with a graphical desktop and HDMI monitor. Firstly, connect the KV260 (please change the IP address if the board has a different IP address).
# Host
ssh -X 192.168.1.108 -l ubuntu
Since DPU on PYNQ needs root
permissions, please make sure that you are running our demo under the root
with the following command.
# Edge
su
cd /home/ubuntu/fpga-asr
bash run.bash
Then you will see the ASR Viewer on your remote desktop and the button will be available after the program finishes initialization. Press the button and the program will start listening. Now you can try to say something with your microphone and then the transcribed result will be shown in the window.
In this project, we deploy models on both DPU and CPU and optimize the procedure of ASR. This is a solution for optimizing the congestion before the model when the system has long sequence input. And instead of deploying speech-to-text models, we choose a small model that is easier to compile and deploy on PL as it has fewer subgraphs and simpler architecture. Finally, we test the application on the KV260, though there is still latency during voice transcribing, our application indeed performs online speech-to-text predicetion and recognizes most of the speech.
Repository Instructionsrun.bash
Scripts to run the application. It will run the asr.py
and call_work.py
at the same time.
asr.py
Code for ASR Process
call_work.py
Code for GUI and Recorder Process
frame.py
Code for online inference
frame_cpu.py
Code for online inference only with CPU. It is used for testing on PC.
marblenet.py
Code for MarbleNet model that is used in model compiling and postprocessing.
quartznet.py
Code for QuartzNet model that is used to export ONNX model.
utils
Code for running the neural network.
variables
Variables of MarbleNet that are used for compiling.
app
Code for the recorder and QT GUI.
backup
Reference model backup.
compile
Scripts and calibration data for compiling MarbleNet.
audio
Audios backup folder.
[1] “Kaldi-on-RaspberryPi2, ” Aug. 4, 2017. Accessed on: Mar. 28, 2022. [Online]. Available: https://github.com/saeidmokaram/Kaldi-on-RaspberryPi2
[2]“Python Edge Speech Recognition with Voice2JSON” Feb. 17, 2022. Accessed on: Mar. 28, 2022. [Online]. Available: https://learn.adafruit.com/edge-speech-recognition-with-voice2json/configuring-custom-commands
[3] “jetson-voice, ” Mar. 10, 2022. Accessed on: Mar. 28, 2022. [Online]. Available: https://github.com/dusty-nv/jetson-voice
[4] Kriman S, Beliaev S, Ginsburg B, et al. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 6124-6128.
[5] Jia F, Majumdar S, Ginsburg B. Marblenet: Deep 1d time-channel separable convolutional neural network for voice activity detection[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 6818-6822
Comments