Vision-based machine learning inference is a hot topic, with implementations being used at the edge for a range of applications from vehicle detection to pose tracking and classification and identification of people, objects and animals.
Due to the complexity of convolutional neural networks, implementing machine learning inference can be computationally intensive. This makes achieving high frame rates challenging using traditional computational architectures. Heterogeneous System on Chips like the Zynq and Zynq MPSoC which combine high performance ARM processors with programmable logic offer solution which can significantly accelerate the performance.
The challenge has previously been creating the programmable logic implementation, which is easy to use and work with common machine learning flows e.g. Caffe and TensorFlow.
The Deep Neural Network Development KitTo address the need to work with common industry frameworks and enable acceleration in programmable logic without the need to implement the entire network from scratch. Deephi (owned by Xilinx) developed the Deep Neural Network Development Kit (DNNDK).
The DNNDK is based on C/C++ APIs and allows us to work with common industry standard frameworks, and with popular networks including VGG, ResNet, GoogLeNet, YOLO, SSD, and MobileNet.
At the heart of the DNNDK, which enables the acceleration of the deep learning algorithms, is the deep learning processor unit (DPU). On our Zynq or Zynq MPSoC system, the DPU resides in the programmable logic. To support different deep learning accelerations there are several different DPUs variants which can be implemented.
The basic stages of deploying a AI/ML application in a Zynq / Zynq MPSoC using DNNDK are:
- Compress the neural network model — Takes the network model (prototext), Trained Weights (Caffe) and produces a quantized model which used INT8 representation. To achieve this, a small input training set is also typically required — this contains 100 to 1000 images.
- Compile the neural network model — This generates the ELF files necessary for the DPU instantiations. It will also identify elements of the network which are not supported by the DPU for implementation on the CPU.
- Create the program using DNNDK APIs — With the DPU Kernels created, we can now build the application which manages the inputs and outputs, performs DPU Kernel life cycle management and DPU Task management. During this stage, we also need to implement network elements not supported by the DPU on the CPU.
- Compile the hybrid DPU application — Once the application is ready, we can run the hybrid compiler which will make the CPU code and link it to the ELFs for the DPUs within the programmable logic.
- Run the hybrid DPU executable on our target.
To perform these five steps, DNNDK provides several different tools, which are split between the host and the target.
On the host side, we are offered the following tools:
- DECENT — Deep compression tool, preforms the compression of the network model.
- DNNC — Deep neural network compiler, performs the network compilation. DNNC has a sub component the DNNAS — deep neural network assembler which generates the ELF files for the DPU.
While on the target side:
- N2Cube — This is the DPU run time engine and provides loading of DNNDK applications, scheduling, and resource allocation. Core components of the N2Cube include the DPU driver, DPU loader, and DPU tracer.
- DExplorer — Provides DPU info during run time.
- DSight — Profiling tool that provides visualization data based on Dtracer information.
In this project we are going to take a look at how we can get the examples up and running on both the Ultra96 and the ZCU104 board.
Once we have these up and running, we will make some simple modifications to the one of the examples so that we can see it running from a live video feed.
Setting UpBefore we can see our first examples up and running we need to download the DNNDK from the Xilinx Website
Once we have the DNNDK downloaded the next step to extract the compressed file until you can see the contents.
In the unzipped directory you will notice the following directories
- Common - Contains a number of images for classification
- Host_x86 - The host development tools e.g DECENT & DNNC
- DP-8020 / DP-N1 / ZCU102 / ZCU104 / Ultra96 - These are the tool chains and example applications for the reference named boards.
To install the tool chains and the boards we need to first download the linux images for the boards of interest from the Xilinx Deephi website
Once these are downloaded we can un-compress the image and write the image to a SD card for either the Ultra96 or ZCU104 (make sure you use the right image)
To do this I used a program such as Disk Imager or Etcher
Once the board has booted the image we will also need to be able to transfer the downloaded DNNDK examples and tools to the board as it runs.
To do this we will be using MObaXTerm which allows us to transfer information to the target board when it is connected to a network.
The next thing to do is set up the boards, both processes are very similar with a slight adaption for the Ultra96 and its WIFI connection.
Setting Up the ZCU104The first board we will examine is the ZCU104, one the SD card ih the image is ready. Double check the boot mode switches on the ZCU104 and power on the board.
Ensure the ZCU104 is connected to a Ethernet connection and you will see th boot process complete after a few seconds.
If at any point you are asked for username and password they are both root.
Once the Linux image is up and running the first thing to do is determine the IP address assigned. We can do this by entering the command
ifconfig
Now the board is up and running and connected to the network we can use mobaXterm to transfer the tools and examples for the ZCU 104
In the mobaXterm enter the command
scp -r /ZCU104 root@<ZCU104 IP address>:~/
This will transfer the files from the development machine to the ZCU104.
We can then install the DNNDK and its examples using either a SSH or serial terminal.
Using a terminal you will notice a new directory has been installed called ZCU104. Under here you will see a install script install the script using the command
./install.sh
This will start the installation process as shown below
As part of the installation process you will see the Linux image re-start
We can now work with the example applications which are provided for tha particular board.
However, for each example we first need to compile it which is achieved by calling the make function in the example you wish to see.
We set up the Ultra96 in a similar manner to that of the ZCU104 however, we need to make a few adaptions for the WIFI.
Once the initial Ultra96 image has booted to be able to install the DNNDK tools and applications we need to first connect to the WIFI.
To do this we need to use a USB hub which is allows us to connect to a mouse and keyboard and connect the Mini DisplayPort output to a suitable monitor.
Power on the Ultra96 and you will see it boot to a desktop environment
To connect to the WIFI, click on the menu and select Internet->Wicd Network Manager. This will open the network manage which lists all of the available networks.
Select the one you want and click connect, if there is encryption then it will issue a warning and ask for the password.
Once the password is entered, click OK and then connect. Connecting to a network might take a few seconds
We can again use mobaXterm to upload the Ultra96 directory and then install the tools and applications on the board as we did for the ZCU104
scp -r /Ultra96 root@<Ultra6 IP address>:~/
again we install the Ultra96 tools and applications by running the command
./install.sh
We are now ready to start working with our boards. For the rest of this project we will focus on the Ultra96 however using the ZCU104 is identical.
Building and Running ExamplesWe can run the DNNDK examples on the Ultra96 using the mouse, keyboard and terminal window on the Ultra96. The desktop proves very useful.
As mentioned above to run each example we first need to make it by running Make in the application directory.
However there are other tools we may want to make use of, which come with the DNNDK
We can use dexploerer to obtain information on the cores included in the PL. Entering the command below will list the
dexplorer -w.
This command will show the type type of DPU implemented in the PL, we can see this below in the ZCU104 and Ultra96 Implementations
We can also use the status command, which will list the status of the DPUs
dexplorer-s
Another interesting command is the profiling command, this will create profiling information.
To start the profiling we run the command
dexplorer -m profile
Once the application has completed we can covert the profile information into HTML using the command below. Each trace will have a different PID depending upon the process ID.
dsight -p dpu_trace_[PID].prof
When re run an application example we will also capture the profile information.
Running the ADAS ExampleLets take an example and run the Ultra96 ADAS example. to run this we need to use the following commands
cd Ultra96/samples/adas_detection
make
dexplorer -m profile
./adas_detection video/adas.avi
This will run the application, offloading element to the DPU in the programmable logic.
When we run this application, a video will appear on the desktop and you will see the video running with cars detected and identified and tracked
Once the application has completed we can view the profile application by converting it to HTML using the command
dsight -p dpu_trace_[PID].prof
Running this on the profile information captured when running the ADAS implementation results in the graph below.
With the applications and tools all running on our boards, we can start to develop our own applications. For this we have two choices one train new network weights using the TensorFlow / Caffe, DECENT and DNNC or we can adapt one of the examples.
To wrap this project up this is exactly what we are going to do
We are going to adapt the pose_detection algorithm to run with a live stream and not a a video. This is a pretty simple modification to make and we can do so with the changes below
Update the declaration of the video capture as below
VideoCapture cap(0);
Modify the main function as below
int main(int argc, char **argv) {
// Attach to DPU driver and prepare for running
dpuOpen();
if (!cap.isOpened()) {
return -1;
}
cap.set(CV_CAP_PROP_FRAME_WIDTH,640);
cap.set(CV_CAP_PROP_FRAME_HEIGHT,320);
// Run tasks for SSD
array<thread, 4> threads = {thread(Read, ref(is_reading)),
thread(runGestureDetect, ref(is_running_1)),
thread(runGestureDetect, ref(is_running_1)),
thread(Display, ref(is_displaying))};
for (int i = 0; i < 4; ++i) {
threads[i].join();
}
// Detach from DPU driver and release resources
dpuClose();
cap.release();
return 0;
}
Modify the Read function as below
void Read(bool &is_reading) {
while (is_reading) {
Mat img;
if (read_queue.size() < 30) {
if (!cap.read(img)) {
cout << "Finish reading the video." << endl;
is_reading = false;
break;
}
mtx_read_queue.lock();
read_queue.push(make_pair(read_index++, img));
mtx_read_queue.unlock();
} else {
usleep(20);
}
}
}
We can then compile the application again using the make command, in the pose_detection directory
make
dexplorer -m profile
./pose_detection
When I ran this on my Ultra96 and had my wife do a little modelling for the testing, I captured the video below which shows the algorithm working pretty well I think.
The final thing to do was to observe the profile information
This project has hopefully provided a good introduction to the DNNDK and how we can easily get it up and running. We can adapt the existing applications if we so desire.
If we want to train a new network for deployment, we will examine that in another project soon.
ReferencesSee previous projects here.
Additional Information on Xilinx FPGA / SoC Development can be found weekly on MicroZed Chronicles.
You can find the files associated with this project here:
Comments