Abstract
Project Purpose and Challenge
CNN Models
Deploy MarbleNet with Vitis-AI
Software
Demo Video
Guide for Reproducing
Summary
Repository Instructions
References

•

•

•

•

Published August 17, 2022 © GPL3+

CAP-HI: Caption Aids Platform for Hearing Impairment

An automatic speech recognition application on KV260.

AdvancedProtip2 hours531

CAP-HI: Caption Aids Platform for Hearing Impairment

Things used in this project

Hardware components

AMD Kria™ KV260 Vision AI Starter Kit

Desktop Microphone, Unidirectional

USB Microphone, supporting sampling rate at 16,000 Hz and Linux OS.

Software apps and online services

AMD PetaLinux

Vitis-AI

DPU-PYNQ

Kria-PYNQ

Story

Abstract

It is rare that a speech processing application is implemented on a Zynq-based device because on one hand, those neural network models are typically large, and on the other hand, Vitis-AI has limited support for 1D convolution, which is crucial to many speech models. In this project, online automatic speech recognition is implemented on the Kria KV260 board. A speech-to-text (STT) QuartzNet is deployed with ONNX Runtime to the Arm processors and a quantized MarbleNet for voice activity detection (VAD) is deployed with Vitis-AI. These two models perform collaborative inference to transcribe the voice and optimize the congestion of STT caused by continuous long audio input. At the same time, a graphical application is implemented for friendly interaction with KV260.

Project Purpose and Challenge

It’s always been a problem that people with hearing impairment are unable to appreciate video media, and a lot of media providers do not provide captions. The problem is even worse in live streams without sign language support provided. Thus, we intend to propose an automatic speech recognition (ASR) solution to generate captions in real-time to aid people with hearing impairment to appreciate video media and live streams. The target language is standard British or American English.

There is a trend that ASR tools are deployed on edge devices to make them portable. For example, the Caldi speech recognition tool is installed and runs on Raspberry Pi in [1]. Another project [2] focuses on the Voice2JSON ASP tool and it is also deployed on Raspberry Pi. Moreover, one project [3] realizes the ASR applications on the AI edge computer Nvidia Jetson Nano. However, these projects usually suffer from a slow inference speed due to the limited throughput of CPU, or a high power consumption due to the use of GPU.

As the implementation of online inference of ASR on edge is still a challenge, in this project, KV260 is chosen as the platform to take advantage of its high parallelism and low power consumption. By dividing our STT algorithm and the preprocessing part to be deployed on the processing subsystem (PS) and the VAD part to be accelerated on the programmable logic (PL), the computation effort is balanced and thus we can get a lower inference latency while also keeping the power consumption at an acceptable level. In other words, the advantage of using heterogeneous multiprocessing platform (MPSOC) is exploited.

CNN Models

There are two CNN models are used in this project and converted from NeMo. And some useful scripts and modules in NeMo and Xilinx's brevitas are adopted by us for easier deployment on edge devices.

Speech to text model - QuartzNet

QuartzNet[4] is a lightweight end-to-end speech recognition model, which is a variant of Jasper that uses 1D time-channel separable convolutional layers, which only contains 18.9M parameters on the 5x15 version and has over 95% accuracy on LibriSpeech dev-other dataset. So, with high throughput and accuracy, QuartzNet can provide frame-level speech-to-text inference and is suitable for applications on edge devices that have limited storage and computing capacity. The voice feature for QuartzNet is the Mel Spectrogram, and a sample of the feature is shown below. As QuartzNet is sensitive to the float input data, it is crucial to keep it as a float model and we deploy it as an ONNX model on the PS layer of the KV260.

Feature for QuartzNet

Voice activity detection model - MarbleNet

MarbleNet[5] is a lightweight end-to-end CNN model, which only contains 88k parameters in the smallest 3X2X64 version and provides frame-level voice activity prediction. And this model is training on Google Speech Command v2 (Speech) and Freesound (Background) datasets and achieves 94.1% accuracy on the test sets. The voice feature for MarbleNet is the Mel frequency cepstral coefficients (MFCC), which is the DCT result of the list of Mel log powers. An example of MFCC is shown below. Since MarbleNet is not very deep in architecture and has simple operators, we deploy it as a quantized model on the KV260 and enable Deep Learning Processor Unit (DPU) acceleration for its tasks.

Feature for MarbleNet

Deploy MarbleNet with Vitis-AI

Model adapting for DPU

Since the Vitis AI Compiler accepts limited operators and VART Programming APIs have better support for Class 1 and 2 inferences, it is necessary to adjust and partition the model for friendly development.

Converting 1D layers to 2D layers

The DPUCZDX8G for KV260 does not accept 1D convolutional layers when concerting a PyTorch model, we solve this problem by converting the 1D layers to 2D layers, a 2D convolutional layer is equivalent to a 1D layer when the extra dimension of the kernel_size is 1, which is the same to the stride. If padding is also applied, the extra padding parameter should be 0, while the other parameters of layers remain the same.

For instance, the definition for a 1D convolutional layer is shown as follows:

self.n_Conv_0 = nn.Conv1d(**{'groups': 64, 'dilation': [1], 'out_channels': 64, 'padding': [16], 'kernel_size': (33,), 'stride': [2], 'in_channels': 64, 'bias': False})

By changing the parameters that are discussed above, it can be converted into a 2D convolutional layer with the output keeping the same.

self.n_Conv_0 = nn.Conv2d(**{'groups': 64, 'dilation': 1, 'out_channels': 64, 'padding': (16, 0), 'kernel_size': (33,1), 'stride': (2,1), 'in_channels': 64, 'bias': False})

For loading pre-trained weights for a new 2D layer, the weight should be expanded at the second dimension.

#v = torch.from_numpy(np.load(b))
v = torch.from_numpy(np.expand_dims(np.load(b),-1))

Model Partition

As the architecture of MarbleNet is shown below, we can identify which part of the model is compilable and uncompilable. It is easy to see that Block 1 to Block B and Prologue parts are sufficient for the requirement of the Vitis AI Compiler. And the Epilogue cannot be compiled because its kernel width is too large. Since the Cross-Entropy function uses many custom operators, it will not be considered in the DPU subgraph either. Although the two layers between the Cross-Entropy function and Epilogue can be compiled with the Compiler, the newly generated DPU subgraph is not efficient for the whole system as it increases the data transmission between the PS and PL and it actually does not take too much resource to compute on CPU. Finally, the partition scheme is to separate the model at the end of Block B, and there will be only 1 DPU subgraph ideally.

1 / 2 • Uncompilable Part of MarbleNet

Quantizing and compiling the model

Scripts for model compilation are provided in the project repository. With the Vitis-AI 1.4 docker and XIR, we can build deployable model files for DPU from the PyTorch model through the following commands.

# Under the Vitis-AI 1.4 docker
conda activate vitis-ai-pytorch
python3 compile.py
vai_c_xir -x ./quantize_result/Model_U_int.xmodel -a /opt/vitis_ai/compiler/arch/DPUCZDX8G/KV260/arch.json -n vad_model_u

Quantize and compile deployble model for KV260

As the compiling result shows, with model partition, there are 3 subgraphs in total with only 1 DPU subgraph, which will make the model easy to deploy with the Vitis-AI Libraries. And with the xdputil tool, we can also output the image of the compiled model to ensure the partition result and data shape are correct.

xdputil xmodel vad_model_u.xmodel -s output.svg

Partition of compiled model

Software

Flow control

A flow control program is used to manage the buffer for recorded voice and communicates with the backend program by the socket protocol. When new audio is recorded, the audio file name will be sent to the queue of inference. As the result is output and send back to the flow control program, the outdated audio files will be deleted, and another new audio file name will be sent to the queue. With this method, we have a larger buffer for inferent and against congestion.

Application

To get the voice signal, a USB microphone that supports 16 Ksps sample rate is needed. And the recorder will save the received PCM signal as WAV files and convert it to the PAM signal if the files are called by the flow control block.

As we have discussed before, VAD is used to optimize the congestion of speech-to-text algorithms and prevent the spreading of latency when the system input rate is larger than the system delay. So after getting the features with preprocessing, the VAD algorithms will detect the voice activity through DPU and Vitis-AI Runtime at a very fast speed and decide whether STT should be run or not.

In order to conveniently record and process the audio from the outside world, a QT GUI was built. And although without a graphical desktop, users can also run the graphical application through the X Window System. The GUI contains one text box for showing the result and one button conditions the microphone recorder to start and stop. And the socket protocol is used for separating the front-end and the back-end.

Application Architecture

Demo Video

Guide for Reproducing

For reproducing this project, hardware and software requirement is listed in the Complete Bill of Materials. And please follow the guide below.

Complete Bill of Materials for reproducing

Setting Edge

This project is based on the DPU-PYNQ environment, so firstly please follow Getting Started Tutorial to have Ubuntu installed with the KV260 board.

To install Kria-PYNQ on Ubuntu, you can refer to the following command lines (It will take about 20-30 minutes):

# Edge
git clone https://github.com/Xilinx/Kria-PYNQ.git
cd Kria-PYNQ/
sudo bash install.sh

Then, clone the project repository to your devices.

# Edge
cd /home/ubuntu/
git clone https://github.com/youjunl/fpga-asr.git
cd fpga-asr

And install dependencies under root.

# Edge
su
pip3 install requirement_a.txt
pip3 install torchaudio==0.8.0 -f https://torch.kmtea.eu/whl/stable.html
pip3 install pynq_dpu
apt-get install python3-pyqt5 python3-pyaudio

Finally, plug in a USB microphone to the board, and the project setting is finished.

Setting X11 For ROOT

X11 setting is very important for the users who have no graphical desktop installed onboard or an HDMI screen to visualize the QT applications. As X11 forwarding and authentication have been integrated for most operating systems, you will not need extra installation.

EnableRoot Login for KV260

If you can do SSH login with root and -X option, X11 will be set automatically. To do this, add PermitRootLogin no to the file /etc/ssh/sshd_config and restart.

# Edge
sudo chmod 777 /etc/ssh/sshd_config
vi /etc/ssh/sshd_config

Now you can try SSH with the host computer.

# Host
ssh -X 192.168.1.108 -l root

(If themethod above does not work) Setting x11 for eachlogin process.

Reconnect the board with X11 forwarding open.

# Host
ssh -X 192.168.1.108 -l ubuntu

Check the authority key under the user ubuntu.

ubuntu@kria:~$ xauth list
kria/unix:10  MIT-MAGIC-COOKIE-1  5a02fa6e7d6910124e8dcb470124f49e

Add the same key under the root.

root@kria:/home/ubuntu# xauth add kria/unix:10  MIT-MAGIC-COOKIE-1  5a02fa6e7d6910124e8dcb470124f49e

Now, X11 forwarding is enabled.

Running the program

In this guide, X11 is used for running the application, and optionally, you can run it with a graphical desktop and HDMI monitor. Firstly, connect the KV260 (please change the IP address if the board has a different IP address).

# Host
ssh -X 192.168.1.108 -l ubuntu

Since DPU on PYNQ needs root permissions, please make sure that you are running our demo under the root with the following command.

# Edge
su
cd /home/ubuntu/fpga-asr
bash run.bash

Then you will see the ASR Viewer on your remote desktop and the button will be available after the program finishes initialization. Press the button and the program will start listening. Now you can try to say something with your microphone and then the transcribed result will be shown in the window.

1 / 2 • Initialization

Summary

In this project, we deploy models on both DPU and CPU and optimize the procedure of ASR. This is a solution for optimizing the congestion before the model when the system has long sequence input. And instead of deploying speech-to-text models, we choose a small model that is easier to compile and deploy on PL as it has fewer subgraphs and simpler architecture. Finally, we test the application on the KV260, though there is still latency during voice transcribing, our application indeed performs online speech-to-text predicetion and recognizes most of the speech.

Repository Instructions

run.bash Scripts to run the application. It will run the asr.py and call_work.py at the same time.

asr.py Code for ASR Process

call_work.py Code for GUI and Recorder Process

frame.py Code for online inference

frame_cpu.py Code for online inference only with CPU. It is used for testing on PC.

marblenet.py Code for MarbleNet model that is used in model compiling and postprocessing.

quartznet.py Code for QuartzNet model that is used to export ONNX model.

utils Code for running the neural network.

variables Variables of MarbleNet that are used for compiling.

app Code for the recorder and QT GUI.

backup Reference model backup.

compile Scripts and calibration data for compiling MarbleNet.

audio Audios backup folder.

References

[1] “Kaldi-on-RaspberryPi2, ” Aug. 4, 2017. Accessed on: Mar. 28, 2022. [Online]. Available: https://github.com/saeidmokaram/Kaldi-on-RaspberryPi2

[2]“Python Edge Speech Recognition with Voice2JSON” Feb. 17, 2022. Accessed on: Mar. 28, 2022. [Online]. Available: https://learn.adafruit.com/edge-speech-recognition-with-voice2json/configuring-custom-commands

[3] “jetson-voice, ” Mar. 10, 2022. Accessed on: Mar. 28, 2022. [Online]. Available: https://github.com/dusty-nv/jetson-voice

[4] Kriman S, Beliaev S, Ginsburg B, et al. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 6124-6128.

[5] Jia F, Majumdar S, Ginsburg B. Marblenet: Deep 1d time-channel separable convolutional neural network for voice activity detection[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 6818-6822