The aim of this project is to compile and deploy a large convolutional neural network which is not upfront designed to run on hardware such as FPGAs. Unet originally was invented for medical applications and is strong in the field of pixel-wise semantic segmentation. In this project we train Unet for semantic segmentation of regular street scenes.
To underline our top-to-bottom approach, from AI research to hardware, we build our project upon a working implementation of Unet from dhkim0225. The repository contains a set of python scripts to generate data for training and evaluation, to run training and load pre-trained (VGG16) weights for the down-sampling part of the network.
System requirementsThe project is tested on both Ubuntu 18.04 and Arch Linux, but most of the tools run encapsulated inside the Vitis AI docker, such that host operating system is more a soft requirement. The system must be able to run docker containers. Tensorflow Keras is available for many platforms.
Training and quantization usually have high RAM usage. Installed RAM of at least 8 Gb is recommended. RAM usage can be reduced by decreasing batch size.
Training the networkWe choose a network configuration with an 256x512
input layer (RGB) targeting classification of four classes: Person, Car, Road and Background.
Cityscapes is a large-scale dataset that contains annotated images of regular street scenes from a cars perspective.
To be able to later use the trained model for quantization, we must replace Conv2DTranspose layers which are not fully supported in Vitis AI 1.2 yet each by a UpSampling2D and Conv2D Layer. The change has to be applied to each of Unet's four up-sampling blocks:
...
# UP 3
x = UpSampling2D(size=(2, 2),data_format="channels_last")(x)
# x = Conv2DTranspose(128, (2, 2), strides=(2, 2), padding='same')(x)
x = Conv2D(128, kernel_size=2, data_format="channels_last",
activation="relu", padding="same")(x)
# x = BatchNormalization()(x)
# x = Activation('relu')(x)
x = Concatenate(axis=3)([x, block_2_out])
...
# UP 4
...
The modified unet.py
is available here.
The main steps for training are simple:
1. Generate optimized and preprocessed data for training (only once):
$ python3 dataset_parser/make_h5.py \
--path "/downloaded/leftImg8bit/path/" \
--gtpath "/downloaded/gtFine/path/"
2. Train the model
# using pre-trained vgg weights is optional
$ python3 train.py --model unet --vgg vgg-weights.h5
3. Test the model
$ python3 test.py --model unet
The default training script runs over 100 epochs and iterates over a large amount of training steps per epoch. Fortunately it uses callbacks and outputs a test image after each epoch into a
./temp
folder. Tests showed that accuracy (Sørensen–Dice coefficient) is already high after a few epochs, especially when initializing with provided VGG16 weights.
Xilinx provides a Vitis AI docker for either GPU or CPU. The GPU docker requires CUDA 10.0 and a supported NVIDIA GPU.
The Vitis AI docker can be obtained from dockerhub or built from source.
Xilinx provides a script to run the docker. If running for the first time it automatically pulls the docker image. Other than that it already passes the right parameters and adds the current working directory to the docker container. So, we copy the script and the file docker/PROMT.txt
inside our working directory and always start the container from there.
./docker_run.sh xilinx/vitis-ai:latest
The Vitis AI docker has preinstalled Conda environments for different AI frameworks. To activate them run (in case of TensorFlow):
conda activate vitis-ai-tensorflow
Add Keras to Vitis AI dockerMake sure the docker image is modified such that Conda has correct keras==2.2.4
package installed:
./docker_run.sh xilinx/vitis-ai-cpu:latest
sudo su
conda activate vitis-ai-tensorflow
conda install keras==2.2.4 # compatible with TF 1.15
conda deactivate
exit
conda activate vitis-ai-tensorflow
In another terminal (while still running the container) stage your modifications for the next docker restart:
sudo docker commit -m "<message>" <containerg_id> xilinx/vitis-ai:latest
Converting trained Keras Model to TensorFlow.pb format(Note: This step can also be done outside Vitis AI docker when using TensorFlow version 1.15 AND keras==2.2.4
)
After finishing the first epoch of training we already should find a file unet_model_weight.h5
which stores the exported weights. The training script automatically exports trained weights after each epoch and discards results, when accuracy did not improve.
Since the Vitis AI Quantizer requires a frozen TensorFlow model in .pb
format, we have to convert the Keras model:
import tensorflow as tf
import keras
from models.unet import unet
### load keras model
keras.backend.clear_session()
keras.backend.set_learning_phase(0)
with keras.backend.get_session() as sess:
model = unet(input_shape=(256, 512, 3the softmax function),
num_classes=4, lr_init=1e-3, lr_decay=5e-4)
model.load_weights('./unet_model_weight.h5')
### write graph and save checkpoint files
saver = tf.train.Saver()
graph_def = sess.graph.as_graph_def()
save_path = saver.save(sess, os.path.join('./out', "float_model.ckpt"))
tf.train.write_graph(graph_def, './out', "infer_graph.pb", as_text=False)
(Note: copy the Keras Model (unet.py
) which was used for training into the current working directory)
The full python script again is provided here.
Analyzing the Model(Note: Outside the Vitis AI docker install netron
)
It is very important for the following steps to know the correct input and output node names of our converted model. We can use the tool netron
to inspect the model graph and get the correct names. The input and output names should name a specific node not just a layer and can have the following form:
Input node name: "input_1"
Output node name: "conv2d_13/BiasAdd"
The input node is usually the first node of the network, whereas the output node should be the last node of the last (in this case Conv2D
) layer. We can cut off the Softmax function, because it only has relative impact on the resulting values, but does not change the pixel-wise classification outcome.
Use TensorFlow freeze_graph
tool which is preinstalled inside Vitis AI docker.
MODEL_PATH=./converted_model_path
OUTPUT_NODE="conv2d_13/BiasAdd"
freeze_graph \
--input_graph ${MODEL_PATH}/infer_graph.pb \
--input_checkpoint ${MODEL_PATH}/float_model.ckpt \
--output_graph ${MODEL_PATH}/frozen_graph.pb \
--output_node_names ${OUTPUT_NODE} \
--input_binary true
Troubleshooting - Freezing fails:IndexError: list index out of range
Make sure the model was converted using keras==2.2.4
which is compatible to TensorFlow 1.15.
We must provide a python input_fn
that provides input images to the Quantizer.
Copy a small subset of the downloaded Cityscape dataset into the working directory and change calib_img_path
accordingly. The input_fn
traverses the folder and returns the input node name and a batch of images. Hence, calib_batch_size
must also be defined for input_fn
, same as the input_node_name
of our network.
import cv2
import os
import numpy as np
import os.path
calib_img_path = "./dataset/"
calib_batch_size = 10
input_node_name="input_1"
def input_fn(iter):
paths = []
for (path, dirname, files) in sorted(os.walk(calib_img_path)):
for filename in sorted(files):
if filename.endswith(('.jpg', '.png')):
paths.append(os.path.join(path, filename))
images = []
for index in range(0, calib_batch_size):
#print("Path: " + paths[(iter * calib_batch_size + index) % len(paths)])
img = cv2.imread(paths[(iter * calib_batch_size + index) % len(paths)], 1)
try:
img = cv2.resize(img,(512, 256))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = img.astype(np.float32) / 127.5 - 1
images.append(img)
except cv2.error as e:
print('Invalid frame!')
print(e)
return {input_node_name: images}
def main():
input_fn()
if __name__ =="__main__":
main()
Now use Vitis AI Quantizer (vai_q_tensorflow for TensorFlow) to quantize the frozen Model:
MODEL_PATH=./converted_model_path
FROZEN_GRAPH=${MODEL_PATH}/frozen_graph.pb
INPUT_NODES="input_1"
OUTPUT_NODES="conv2d_13/BiasAdd"
OUTPUT_DIR="./quantize_result"
ITERATIONS=20
vai_q_tensorflow quantize \
--input_frozen_graph ${FROZEN_GRAPH} \
--input_nodes ${INPUT_NODES} \
--input_shapes ?,256,512,3 \
--output_nodes ${OUTPUT_NODES} \
--input_fn graph_input_fn.input_fn \
--method 1 \
--weight_bit 8 \
--activation_bit 8 \
--calib_iter ${ITERATIONS}\
--simulate_dpu 1 \
--output_dir ${OUTPUT_DIR}
(Note: to use GPU acceleration for quantization, add the parameter --gpu $id
)
For compilation and later inference of our DPU PL accelerator we need the following hardware files which originally come out of the Vivado hardware design process:
dpu.bit
dpu.hwh
dpu.xclbin
A
pre-build set of files for two different DPU architectures can be obtained from here.
(Note: For use with Vitis AI v1.2 and Zynq DPU v3.2 the files must be build with Vivado 2020.1)
The Vitis AI Compiler takes DPU configurations as a dpu.dcf
file, which can be parsed from the dpu.hwh
file using the tool dlet
:
$ dlet -f dpu.hwh # run this inside Vitis AI docker
If you want to reconfigure the DPU and therefore re-build PL, refer to section "Appendix -- Generate Hardware Files"
Model CompilationThe Vitis AI Compiler compiles our quantized network targeting a specific DPU and PL configuration described by Ultra96.json
which points to a dpu.dcf
file.
{
"target" : "DPUCZDX8G",
"dcf" : "./hw_files/B2304/dpu.dcf",
"cpu_arch" : "arm64"
}
When having the hardware files in place, we can run compilation using vai_c_tensorflow
:
HW_FILES='./hw_files/B2304'
OUTPUT_DIR='./compile_result'
KERNEL_NAME='Unet'
vai_c_tensorflow \
--frozen_pb ./quantize_result/deploy_model.pb \
--arch ${HW_FILES}/Ultra96.json \
--output_dir ${OUTPUT_DIR} \
--net_name ${KERNEL_NAME} \
--options "{'mode':'normal', # change to 'debug' for debugging
'dump':'fused_graph_info', # outputs helpful kernel info
'save_kernel':'$OUTPUT_DIR'}"
Now, the OUTPUT_DIR should contain a compiled DPU Kernel Unet.elf,
which can later be passed to Vitis AI python/c++ APIs to run DPU inference. Check the compilation output or the generated info files such as Unet_kernel.info
for correct input and output tensor names:
Kernel topology "Unet_kernel_graph.jpg" for network "Unet"
kernel list info for network "Unet"
Kernel ID : Name
0 : Unet
Kernel Name : Unet
------------------------------------------------------------------------------
Kernel Type : DPUKernel
Code Size : 1.86MB
Param Size : 24.66MB
Workload MACs : 226811.19MOPS
IO Memory Space : 40.44MB
Mean Value : 0, 0, 0,
Total Tensor Count : 39
Boundary Input Tensor(s) (H*W*C)
input_1:0(0) : 256*512*3
Boundary Output Tensor(s) (H*W*C)
conv2d_13_convolution:0(0) : 256*512*4
Total Node Count : 34
Input Node(s) (H*W*C)
block1_conv1_convolution(0) : 256*512*3
Output Node(s) (H*W*C)
conv2d_13_convolution(0) : 256*512*4
Setup Ultra96 v2 with PYNQ v2.6With Model Compilation we completed all steps on the host machine and setup can continue on the Ultra96 v2.
Avnet provides pre-build versions of latest PYNQ v2.6 images for Ultra96 v2 on github. Just follow the instructions provided in the releases section of Avnet/Ultra96-PYNQ.
The great advantages of upgrading Ultra96 v2 to the latest PYNQ v2.6 are that we have all the comfort of easy-to-access jupyter notebooks and even upgradeability through package manager APT.
Connect to Ultra96 v2 via ssh over usb cable. Refer to the getting started guide to learn how to connect and activate Wifi through a provided jupyther notebook wifi.ipynb.
Finally the Vitis AI DPU Python/C++ APIs must be installed. We can skip upgrading the image as described in DPU-PYNQ's Quickstart Guide, because Ultra96 v2 is already running on PYNQ v2.6. Just install pynq-dpu and make installed example notebooks available to Jupyter:
## run on Ultra96 v2
$ pip3 install pynq-dpu
$ cd $PYNQ_JUPYTER_NOTEBOOKS # usually ~/jupyter_notebooks/
$ pynq get-notebooks pynq-dpu -p
(Note: make sure to have Ultra96 v2 connected to Wifi with internet)
Deploy ModelWhen Ultra96 v2 setup is completed, connect to the board via ssh and create a new working directory. The directory must reside in a sub-folder of (normally) 'jupyter_notebooks'
in order to be recognized by Jupyter.
Then copy the following files into the new working directory (using scp):
dpu.bit
(bitstream)dpu.hwh
dpu.xclbin
Unet.elf
(Note: it is important that dpu.hwh
is in the same directory and has the similar naming as dpu.bit,
because it is called implicitly when loading the bitstream dpu.bit
)
For inference of our AI accelerator we create a regular jupyter notebook in the working directory and add the following lines of code:
(Note: This section gives detailed explanation for different parts of the jupyter notebook, jump to the end to find a link to the complete notebook)
Load bitstream
from pynq_dpu import DpuOverlay
from dnndk import n2cube
overlay = DpuOverlay("./dpu.bit") # loads bitstream from current working directory
(Note: don't confuse with DpuOverlay("dpu.bit")
which loads a default DPU bitstream that comes with pynq_dpu
package)
Check if DPU is configured & loaded correctly:
!dexplorer -w
Configure and load DPU Kernel and create a task
overlay.load_model("./dpu_Unet.elf")
KERNEL_NAME = "Unet"
KERNEL_CONV_INPUT = "block1_conv1_convolution"
KERNEL_OUTPUT = "conv2d_13_convolution"
(Note: check Vitis AI Compilation output and insert correct input and output names accordingly. Otherwise the program will crash)
n2cube.dpuOpen()
kernel = n2cube.dpuLoadKernel(KERNEL_NAME)
task = n2cube.dpuCreateTask(kernel, 0)
Get DPU Kernel parameters
input_len = n2cube.dpuGetInputTensorSize(task, KERNEL_CONV_INPUT)
output_len = n2cube.dpuGetOutputTensorSize(task, KERNEL_OUTPUT)
conf = n2cube.dpuGetOutputTensorAddress(task, KERNEL_OUTPUT)
output_scale = n2cube.dpuGetOutputTensorScale(task, KERNEL_OUTPUT)
Import required packages
import numpy as np
import os
from time import time
import matplotlib.pyplot as plt
import cv2
# maps output classes to RGB image
def result_map_to_img(res_map, img):
#img = np.zeros((256, 512, 3), dtype=np.uint8)
res = np.squeeze(res_map)
argmax_idx = np.argmax(res, axis=2)
# For np.where calculation.
person = (argmax_idx == 1)
car = (argmax_idx == 2)
road = (argmax_idx == 3)
img[:, :, 2] = np.where(person, 255, 0)
img[:, :, 1] = np.where(car, 255, 0)
img[:, :, 0] = np.where(road, 255, 0)
Load batch of images
This is a similar function to the input_fn
used for quantization. So, again copy a subset of Cityscapes test images and change calib_img_path
and calib_batch_size
accordingly:
calib_img_path = "./leftImg8bit/train/"
calib_batch_size = 5
start_iter = 135
## images for evaluation
paths = []
for (path, dirname, files) in sorted(os.walk(calib_img_path)):
for filename in sorted(files):
if filename.endswith(('.jpg', '.png')):
paths.append(os.path.join(path, filename))
images = np.zeros((calib_batch_size, 256, 512, 3))
images_scaled = []
for index in range(0, calib_batch_size):
#print("Path: " + paths[(start_iter * calib_batch_size + index*35) % len(paths)])
img_raw = cv2.imread(
paths[(250 * calib_batch_size + index) % len(paths)], 1)
try:
img_scaled = cv2.resize(img_raw,(512, 256))
images_scaled.append(img_scaled)
img = cv2.cvtColor(img_scaled, cv2.COLOR_BGR2RGB)
img = img.astype(np.float32) / 127.5 - 1
images[index] = img.astype(np.float32)
plt.imshow(img_scaled)
except cv2.error as e:
print('Invalid frame!')
print(e)
Run classification batch
This script runs classification on a batch of calib_batch_size images. The function result_map_to_img(classes, result[i])
is defined above and simply runs Argmax on the classification output and maps the result into an RGB image.
start = time()
result = np.zeros((calib_batch_size, 256, 512, 3), dtype=np.uint8)
classes = np.zeros((256, 512, 4), dtype=np.int8)
for i in range(calib_batch_size):
image_data = np.expand_dims(images[i].astype(np.float32), 0)
n2cube.dpuSetInputTensorInHWCFP32(task, KERNEL_CONV_INPUT, image_data, input_len)
n2cube.dpuRunTask(task)
image_out = n2cube.dpuGetOutputTensorInHWCFP32(task, KERNEL_OUTPUT, output_len)
classes = np.array(image_out, dtype=np.int8)
classes = np.reshape(classes, (1, 256, 512, 4))
result_map_to_img(classes, result[i])
stop = time()
throughput = calib_batch_size / (stop - start)
print("Throughput in FPS: ", throughput)
Write classification results to file
For better visualization we combine input and output images:
results_combined = np.zeros((calib_batch_size, 256, 2*512, 3), dtype=np.uint8)
for i in range(calib_batch_size):
results_combined[i] = np.concatenate((images_scaled[i], result[i]), axis=1)
batch_results_combined = results_combined[0]
for i in range(1, calib_batch_size, 1):
batch_results_combined =
np.concatenate((batch_results_combined, results_combined[i]), axis=0)
## write combined batch results to file
cv2.imwrite("./batch_results.png", batch_results_combined)
The complete jupyter notebook is available here. It can be copied into the working direction on Ultra96 v2, which has to be the same folder for the hardware files.
AccuracyAs we can testify, Unet does indeed perform pixel-wise semantic segmentation with very high accuracy. Already after a few (approx. 20) epochs, the trained model can predict with an accuracy value(Sørensen–Dice coefficient) higher than 0.93. Quantization had surprisingly low (at least no visible) impact on the segmentation accuracy.
(Note: Another evaluation script which can be used to analyze the frozen graph or the quantized graph can be found here. Just change path names to input data accordingly and pass the correct model graph to be evaluated.)
Analyze PerformanceOne method is to measure inference time right inside the jupyter notebook, when running classification. It is recommended to measure inference time of a larger batch of classifications and average to get better accuracy. Nevertheless, this method does not tell us how much time was spend purely on execution on the DPU and what part goes to the software.
Xilinx provides a tool called dsight
to trace DPU Core-Utilization. Therefore we must set mode
to 'debug'
instead of 'normal'
for compilation and set the DPU into profiling mode before creating a DPU task:
!dexplorer -m profile # run this to activate debug mode and dump traces
The DPU trace ('.prof'
) is stored in the working directory and can be visualized with dsight
:
$ dsight -p dpu_trace_1876.prof # generates dpu_trace_1876.html
Use Jupyter to navigate to the generated '.html' trace file. The file can be display in a web browser:
We see that inference on the DPU takes most of the overall inference time, namely 760ms, which leads to a theoretical frame rate rate of 1.31 FPS. The measured frame rate in the Jupyter notebook is approximately 1.28 FPS.
Performance OptimizationAnalysis shows that the Vitis AI Python API already is highly optimized and margin for software optimization, e.g. by moving to pure C++ implementation is fairly small compared to overall inference time.
Structural Optimization
Convolutional layers (Keras Conv2D Layers) store most of the information inside of the network and are responsible for high accurary, but unfortunately also for high inference time. We can re-arrange the computation of the convolutional layers by introducing separable convolutions. This article gives more detailed explanation of how this re-arrangement works.
We replace all convolutional (Conv2D) layers except the first two layers. These layers usually store essential information to detect features and are needed to still be able to initialize the network with pre-trained VGG16 weights as described in the first section. The modified network is available here.
The modified network must be retrained. This comes with a drop in accuracy and training might take significantly longer. The steps for deployment remain the same, just make sure to change naming of input/output nodes accordingly.
This modification increases the frame rate by almost 3x to approximately 3.3 FPS. When analyzing the DPU Core Utilization again, we see that the two remaining Con2D layers are still the bottleneck in the graph. But replacing them too causes a not tolerable drop in accuracy.
Pruning
Pruning is most likely the key to further optimize performance of our network. Xilinx provides a very powerful pruning tool within the Vitis AI Optimizer which is available to commercial customers. Pruning is a very computation intense process and requires high end GPUs to run on. I prepared the necessary scripts to run pruning on this model, but was not able to prune the whole model due to lack of resources.
ConclusionThe whole project is also available on github.
This project demonstrates how to deploy a regular but rather large convolutional neural network to an Edge device such as Ultra96 v2 using Vitis AI. Whereas performance still must be optimized, we preserve the high accuracy of the network even when running classification on the FPGA.
Appendix -- Generate Hardware Files(Note: This requires an installation of Vivado 2020.1 outside of Vitis AI docker)
Vitis AI 1.2 requires all hardware to be generated with Vivado 2020.1. In order to use a custom PL accelerator with PYNQ v2.6, one could just manually create a block diagram in Vivado, run synthesis and implementation and export the hardware files. For pure DPU inference, Xilinx provides scripts ( DPU-PYNQ ) which automatically run Vivado to generate the hardware files based upon a configuration file.
Note: DPU-PYNQ is not updated for Vivado 2020.1 yet, but most likely will be updated soon. Nevertheless, DPU-PYNQ scripts do pull the Vivado project from another repository PYNQ-derivative-overlays, which already contains the updated version. This means we only have to update two lines in get_platform.sh
and check_env.sh
in the DPU-PYNQ repo. A fork of DPU-PYNQ already contains these modifications and is available here (branch 'update2020.1').
Configure the DPU
The DPU can be reconfigured by changing dpu_conf.vh
file. The template configuration file is well-commented. Refer to Zynq DPU Product Guide for more details on the configuration options. It should be noted that DPU configuration is always constrained by the available amount of DSPs, BlockRAMs and LUTs. For example Ultra96 v2 supports DPU architectures up to B2304.
Build DPU
Follow these steps, namely run:
$ make BOARD=Ultra96
to build the following hardware files:
dpu.bit
dpu.hwh
dpu.xclbin
Comments