Today's homes have increasingly many smart devices, from TVs, to lights, to window blinds, and beyond. All of these devices typically have individual apps to control them. It quickly becomes frustrating to find the app control different devices in your home.
Controlling these devices with a virtual assistant such as Alexa or Google Home can work, but requires a user to be within speaking range of a hub, or pull out their phone. Voice assistants can also feel rather unnatural at times, when it comes to how verbose and precise the command must be in some situations. For example, if you have multiple smart lights in your home, you would have to specify "Turn off the overhead light in the living room".
Wouldn't it be great if our smart home was able to tell what room we were in, what device we are looking at, and control the device with a wave of a hand? Why ask Alexa to turn on the lights, when you can simply point to them instead? Adjust the volume of the TV by turning an imaginary knob, and changing the channel by swiping the air in front of you!
Smart home security cameras are becoming increasingly common, but the video that these cameras generate is mostly unused, only being checked occasionally. What if we were to process this data we are already generating, and use it to control IoT devices?
Fortunately, with powerful new edge computing devices, such as the Kria KV260 from Xilinx, applications like this are becoming more and more feasible.
In this project, we will look at how to use Xilinx KV260 to process incoming video to detect and classify hand gestures in real time, and use those predictions to control a smart home device.
Thank you to Xilinx for providing the KV260 Starter Kit as part of the Xilinx Adaptive Computing Challenge.
OutlineIn this tutorial, we will look at how to quantize and compile a pretrained PyTorch model to run on Xilinx Kria KV260 SOM using the Vitis-AI 1.4 tools. We will then look at how to set up our KV260 with Ubuntu, and install the PYNQ DPU Overlay. This will allow us to run our compiled model on the KV260 with Python code.
Finally, I will use the model to detect some hand gestures from a USB webcam and then, based on the output of the model, send some commands over my local WiFi network to my FireTV stick, to allow me to navigate the menus using hand gestures.
All the code for this project, as well as additional tips, tricks, and troubleshooting, can found on the GitHub page for this project.
This tutorial assumes you have familiarity with the following concepts:
- Python, PyTorch, Virtual Environments / Conda
- Basic ML concepts and terminology
- Ubuntu / Linux / Windows Subsystem for Linux 2 (WSL2)
- SSH (X11 forwarding, optional)
- Docker (minimal)
For this tutorial, we will assume that we already have a pretrained PyTorch model with its weights saved to a.pt /.pth file.
Note: We will be using Vitis-AI 1.4, which runs an older version of PyTorch. You will need to make sure your model is compatible with torch==1.4.0, which may require saving your.pt file with _use_new_zipfile_serialization=False if you trained your model with a newer version of PyTorch
If you want to build and train a model from scratch, there are plenty of tutorials out there for that, as well as free compute recourses such as Google Colab, which allow free (limited) use of GPUs for training and experimentation.
Note: This tutorial focuses on converting a PyTorch model to xmodel, but the docker tools we install here also have functionality to convert TensorFlow and Caffe models to xmodel as well. See the Xilinx Docs. Once the model has been quantized and compiled to an xmodel for the KV260, the later sections in this tutorial can be used as is, regardless of what format our float model was originally built in.
I will be using a custom model that I trained on a dataset of me in front of my TV performing the different hand gestures I wanted to use.
The model is capable of predicting six different gestures: up, down, left, right, palm, and fist. We will use these gestures to navigate the menu on my TV.
The model processes an input vector of shape [1, 63] which corresponds to flattened array of x, y, z coordinates of 21 hand key points, which we extract in a preprocessing step using MediaPipe (more on that later). The model outputs a [1, 6] vector, which we can take the softmax of to determine the probabilities of each class, taking the maximum entry as our predicted output.
The pretrained model (both the .pt and the KV260 compiled versions) is available on the GitHub page for you to try out. Keep in mind, your mileage may vary with its effectiveness, since its specifically trained for my hardware setup and environment.
1. InstallXilinxVitis-AI tools
For this step, we will download a docker image from Xilinx, and run several commands in it. You need a computer with Linux, or capable of running a Linux VM. I will be using Windows 10 with WSL2 Ubuntu 20.04 in this example.
- On Windows, make sure to accept the permissions for Docker to run with WSL2
2. Open the Linux terminal and clone the https://github.com/Xilinx/Vitis-AI repo.
3. Pull the proper version of the Vitis-AI docker image. For this project we want Vitis-AI 1.4. Note that there are there are GPU versions and CPU versions of the docker images. If you don't have a cuda GPU in your system, you must use the CPU version.
# pick the correct version
docker pull xilinx/vitis-ai-cpu:1.4.1.978 # CPU version
docker pull xilinx/vitis-ai:1.4.1.978 # GPU version
4. Navigate to the repo we cloned earlier, and call the docker_run.sh
script with the tag of the docker that you want to run. Ex:
peter@PeterDesktop:/mnt/d/Vitis-AI$ ./docker_run.sh xilinx/vitis-ai-cpu:1.4.1.978
- Note: You might need to change line endings in
docker_run.sh
to be Unix compatible (open in notepad++ and change the option to LF in the bottom left corner and save) if you get an error from the bash script
For additional details see the Xilinx Docs for Installation.
Accept the prompts and you should be greeted with a nice little Vitis-AI text art.
==========================================
__ ___ _ _ _____
\ \ / (_) | (_) /\ |_ _|
\ \ / / _| |_ _ ___ ______ / \ | |
\ \/ / | | __| / __|______/ /\ \ | |
\ / | | |_| \__ \ / ____ \ _| |_
\/ |_|\__|_|___/ /_/ \_\_____|
==========================================
Docker Image Version: 1.4.1.978
Vitis AI Git Hash: 9f3d6db
Build Date: 2021-10-08
2. Quantize the model
The KV260 is designed to internally use integers to do all the neural network computations. This is different from the way GPUs work, which use floating point values. To convert our model and weights to work on the FPGA, we need to perform a step called quantization.
The tools we will use quantize our model are organized inside a conda virtual env. To activate it, we run:
conda activate vitis-ai-pytorch
Note that this conda env runs PyTorch==1.4.0. Some operations in a Pytorch model from a more recent version of PyTorch might not be supported.
The quantizing will be done by a python script that uses the Xilinx pytorch_nndct.apis
python library. The script I used in provided model_data/quantize.py
from the linked GitHub Repo. It should be fairly easy to adapt model_data/quantize.py
to work with your own model and dataset.
To run this script, we will need three things:
- The weights of the pretrained network (a.pt file)
- The float / PyTorch definition of your model (a class that inherits from torch.nn.Module)
- A small test dataset (200-1000 images, to check the accuracy of the quantized model. The quantization process can degrade the performance of the model significantly for some models. This is actually optional. If we want to skip this step, we can just forward the quantized model once with random input before exporting it.
To have access to these files, we need to place them (the quantize.py file, the model weights, the model definition, and a test set) in a directory that is visible to the docker. In this case, we will use the Vitis-AI/data
directory inside the repo we cloned earlier, which is set to be visible to the docker by the bash script we used to start it.
Then from the terminal running the docker, we can call the python script:
(vitis-ai-pytorch) Vitis-AI /workspace/data > python quantize.py
After running this script, we should see a new folder in the data
directory called quantize_result
, with a <model_name>_int.xmodel
file. If that's the case, we are ready to move onto the next step. If something happened, please see the troubleshooting section.
Additional reading: Xilinx PyTorch Quantization Docs
3. Compile the model
Now that our model has been quantized to a Xilinx Intermediate Representation (XIR).xmodel file, we need to compile it for the specific hardware we are going use: a KV260.
From inside the docker conda env, we are going to run the vai_c_xir
compiler tool, which expects the following structure:
vai_c_xir -x /PATH/TO/quantized.xmodel -a /PATH/TO/arch.json -o /OUTPUTPATH -n NETNAME
- -x: Path to the quantized model, which should be
data/quantize_result/<model_name>_int.xmodel
if you followed the previous instructions. - -a: Target architecture json. For the KV260 this will be:
/opt/vitis_ai/compiler/arch/DPUCZDX8G/KV260/arch.json
- -o: Directory where you want to output the model. Suggestion:
.
- -n: Filename you want the compiled model to have. Ex:
mymodel_kv260
Warning: Pay attention to the output line where it says DPU subgraph number X
. This must be 1. Otherwise, when we try to run the compiled model on the KV260, we will have problems. If this is something other than 1, see the Troubleshooting section.
If it runs successfully, you should see file called mymodel_kv260.xmodel
This is the compiled model. Copy it over to the KV260, in a place where it will be accessible. We will load and run it using a Python script in the following sections.
Troubleshooting
If you are trying to quantize, compile, and run your own model, this quantize/compile section is probably where you will encounter the most problems. Checkout the troubleshooting guide in the GitHub repo for some help.
Set up the KV260 with Ubuntu, PYNQ, and MediapipeFor this step, we are going to start by following the instructions provided on PYNQ Kria GitHub repo. In brief, the steps are:
1. Flash the microSD card with the Ubuntu image.
2. SSH into the KV260
3. Clone the PYNQ Kria repo.
4. Run the provided install script.
The PYNQ installation will create a virtual environment called pynq-venv with all the Vitis-AI 1.4 tools (VART, XIR, DPU Overlay) for running compiled xmodels through a convenient Python API.
Finally, we have to do one additional thing: install Mediapipe to the pynq-venv.
Mediapipe does not have an aarch64 pip install, so we will either have to compile the wheel ourselves on the KV260 (Instructions), or you can use the the wheel I have provided in this project's GitHub repo.
To install the wheel file, first we have to activate the pynq-venv as root, then just run pip install:
sudo -i
source /etc/profile.d/pynq_venv.sh
pip install path/to/mediapose/wheel
Run our compiled model with PYNQ DPU OverlayTo run our compiled model, and process the output to control my FireTV stick, I have written a script called app_kv260.py
.
We will run this script in the pynq-venv
we prepared in the previous section. To activate the venv and run the python script, we will use the following commands..
sudo -i
source /etc/profile.d/pynq_venv.sh
xauth merge /home/ubuntu/.Xauthority # optional enable X11 forwarding (ignore warning)
python <path_to_script>
I have also included a line that enables X11 forwarding, which will allow us to see the display from OpenCV on the computer we are using to SSH into the Kria.
These are the critical lines for setting up the DPU with our model and running inference.
# Set up DPU
overlay = DpuOverlay("dpu.bit")
# Path to your compiled x model
path = '/home/ubuntu/my_model.xmodel'
# gives an assertion error if DPU subgraph number > 1
overlay.load_model(path)
dpu = overlay.runner
# Set up space in memory for input and output of DPU
inputTensors = dpu.get_input_tensors()
outputTensors = dpu.get_output_tensors()
shapeIn = tuple(inputTensors[0].dims)
shapeOut = tuple(outputTensors[0].dims)
input_data = [np.empty(shapeIn, dtype=np.float32, order="C")]
output_data = [np.empty(shapeOut, dtype=np.float32, order="C")]
# Load in input data, and run inference
x = get_input_data() # generic function to get input data
input_data[0] = x
job_id = dpu.execute_async(input_data, output_data)
dpu.wait(job_id)
y = output_data[0]
process_output(y) # generic output function to process output
Here's an example of the visualizer output, showing that it is able to correctly classify some hand gestures.
The app runs at about 3 FPS, which is sufficient for interacting with the menu UI in realtime. Most of the run time is from OpenCV grabbing frames and mediapipe preprocessing the data for the model that is running on the KV260 DPU. The inference for the compiled model is a fraction of a second.
Process the output and control our smart home devicesNow that we have some output from our model, we will need to process it and send commands to a smart home device that we want to control. In this tutorial, I will be controlling my FireTV stick over my local Wi-Fi network.
The FireTV stick is based on Android, and has a built in debug mode (ADB) which allows us to send commands to a shell terminal to interact with the device. And it all comes with Python bindings! We just have to enable ADB in the FireTV settings and note down the IP address to get started. The full set up for that can be found in this article.
Here is the main code for our controller class. You find the full code for this in the GitHub repo.
class FireTVController():
def __init__(self):
if not os.path.isfile('adbkey'):
print("Generating ADB Keys")
keygen('adbkey')
else:
print('ADB keys found')
with open('adbkey') as f:
priv = f.read()
with open('adbkey.pub') as f:
pub = f.read()
self.creds = PythonRSASigner(pub, priv)
def add_device(self, deviceIP):
self.device = AdbDeviceTcp(deviceIP, 5555, default_transport_timeout_s=9.)
try:
self.device.close()
except:
print("No device connected")
else:
self.device.connect(rsa_keys=[self.creds], auth_timeout_s=10)
print("Device Connected")
return self.device
def send_command(self, cmd: str):
assert cmd in KEYCODES.keys() # ensure that a valid command is passed
self.device._service(b'shell', b'input keyevent ' + KEYCODES[cmd])
If there are other devices you want to control, you might consider having the KV260 send triggers to IFTTT through the webhooks interface. With that setup, you should be able to communicate with just about any smart home device out there.
Possible Extensions and Future Work- Ability to recognize a larger number of hand gestures
- Process body pose as well to more accurately predict gestures or enable new types of gestures
- Use video from home security cameras sending data over RTSP
- Use a model to directly process input frames to perform the hand key point extraction step and the recognition step. This was actually my original plan, but I was not able to find a pretrained model that was quantizable with the Vitis-AI software (e.g. I couldn't OpenPose and Vitis to play nice).
- Process video, rather than individual frames. Much more challenging, but this would allow more natural gestures like swiping up and down, a "click" tapping action, or turning an imaginary volume knob, for controlling different functions.
The KV260 is an awesome piece of hardware for performing inference on computer vision tasks. Its a little tricky to get started with, but hopefully this project will give you some guidance! I look forward to upcoming improvements in the Vitis-AI workflow from Xilinx, and hopefully even more hobbyist focused hardware in the future.
TroubleshootingPlease checkout the Troubleshooting section in the GitHub repo. (This page was getting a bit tad long)
Comments