Introduction
Initial setup
Eye state detection model
Freeze graph
Vitis-AI
Quantization
Compilation
Deploying on target
Performance Results
References

Published March 31, 2022 © GPL3+

Eye State Detection model implementation on Kria

Leveraging the Vision AI performance of KV260

IntermediateFull instructions provided8 hours565

Eye State Detection model implementation on Kria

Things used in this project

Hardware components

AMD Kria™ KV260 Vision AI Starter Kit

Software apps and online services

Vitis AI

Story

Introduction

AI is a constantly evolving field and often changes in software occur too quickly for hardware accelerators to catch up. The increasing complexity of Deep Learning algorithms demands huge computational power and flexible hardware. FPGAs can provide a needed boost in performance whilst also adapting to the latest and compute-intensive AI techniques. With almost 3X AI performance and 2X performance-per-watt versus competing SOMs, Kria SOM proved to be an ideal development environment for ML edge applications. The Vision AI starter kit comes with the vision-centric carrier card and the pre-built accelerated applications which can be customized to our needs at different levels of abstraction (software application to hardware design).

Initial setup

The Kria K26 SOM can be configured with various Deep learning Processing Unit configurations based on performance requirements. The KV260 benchmarks are derived on DPU B4096 while the KV260 applications are designed on DPU B3136.

KV260 image - B4096, B3136

Vitis AI - 1.4.1

Docker image - 1.4

Refer to the detailed guide here on:

how to flash the image
set up the board
run the smart camera application (optional but recommended).

Eye state detection model

Detecting and monitoring the eye state of a person can be used in various applications from driver assistance systems (drowsiness detection) to health care (determining the fatigue levels of an individual).

The data set used for the training can be found here.

image size - 150x150 (The inputs to the CNN model need to be square for Vitis-AI. If the input isn't square it must be padded to make a square).

The binary classification model training flow can be found here. For this particular run, the validation accuracy was 0.912

Freeze graph

Create the frozen graph of the trained model (moving the model from the training phase to the inference phase through freezing - locking the weights). Pass the trained model and checkpoints to the freeze_graph() function.We can use the Vitis in-built freeze graph function as well, to create the frozen graph.

Use Tensorboard/Netron to visualize the frozen model, get the input/output nodes.

> pip install netron
> import netron
> netron.start('classification_model.h5')

Trained model, frozen graph visualized in Netron

Verify that freezing the model has not caused any significant variation in accuracy using evaluate_graph() function.

The accuracy of the frozen model is 0.908

Vitis-AI

Clone the Vitis-AI repository in your project directory and check out the 1.4 version.

git clone https://github.com/Xilinx/Vitis-AI
cd Vitis-AI
git checkout v1.4

The docker setup mounts the current directory as /workspace/ in the container. In the docker environment, you should have the binary_classification directory of this project in your workspace.

.<path_to_Vitis-AI>/docker_run.sh xilinx/vitis-ai-cpu:1.4.1.978

Activate the TensorFlow environment

Vitis-AI /workspace > conda activate vitis-ai-tensorflow

Use the setenv.sh script to create the required folders.

(vitis-ai-tensorflow) Vitis-AI /workspace $ ./setenv.sh

Use Vitis inspect command to estimate the input and output nodes.

(vitis-ai-tensorflow) Vitis-AI /workspace $ vai_q_tensorflow inspect --input_frozen_graph=frozen_graph.pb

Quantization

One of the reasons to prefer Kria SOM over others is, it provides low precision support. Quantization essentially reduces the number of bits used for our tensors, weights and hence reduces memory usage. Going from 32-bit float to 8-bit fixed reduces our memory usage four times without much impact on the accuracy.

The Vitis-AI quantizer performs several forward passes on our training set and chooses an optimum scheme. The quantizer takes an input function (input_fn.py) to convert the calibration dataset to the input data of the frozen graph during quantize calibration (usually performs data pre-processing and augmentation). Run the quantize.sh script.

(vitis-ai-tensorflow) Vitis-AI /workspace $ ./quantize.sh

Quantize the frozen graph using Vitis-AI

Check the impact on accuracy. For this run, the accuracy of the quantized model was found to be 0.892

Compilation

The VAI_C compiler takes in the quantized model, optimizes the data and control flow, and splits the model into kernels to run on either DPU or CPU.

The --arch option supplies the specific configuration of the DPU Architecture. Create an arch.json file with the target DPU. (DPUCZX8G_ISA0_B4096_MAX_BG2 for B4096)

{
"target": "DPUCZX8G_ISA0_B3136_MAX_BG2"
}

The –options parameter provides specific options for either edge or cloud flows of FPGAs or to dump debug files or if we want to run in debug or normal mode. In debug mode, the nodes of the DPU are run once at a time so we can explore debugging or performance profiles of each node. In normal mode, the DPU runs without interruption.

(vitis-ai-tensorflow) Vitis-AI /workspace $ ./compile.sh

Compiling the quantized model for DPU B3136

Outputs the compiled.xmodel, md5sum.txt and meta.json files.

We can generate a png of the compiled model, to check the individual layers. layers as DPU subgraph - outlined in bluelayers deployed on the CPU - outlined in red

(vitis-ai-tensorflow) Vitis-AI /workspace $ xir png binary_classification.xmodel compiled_graph.png

Deploying on target

Create an application to test our model performance on the board. We use the vart runner class to handle the initialization and the communication with the DPU API.

Creating the runner: dpu = vart.Runner.create_runner()
Get input, output tensors: dpu.get_input_tensors(), dpu.get_output_tensors()
Execute the runner: job_id = dpu.execute_async(inputData, outputData)
Wait for the runner to finish: dpu.wait(job_id)

Set up the threads to run the DPU. (Vary between different threads to compare the performance and find the optimal results)

Copy the test images, compiled model and the app to the board.

The B4096 image is the DPU configured image. Thereby we can directly run the app after copying the required files. In the B3136 image, we can load the DPU by loading one of the corresponding pre-built applications.

Install Vitis AI package group.

sudo dnf install packagegroup-petalinux-vitisai
sudo dnf install packagegroup-petalinux-opencv

Load the smartcam app to load the associated DPU.

sudo xmutil unloadapp
sudo xmutil loadapp kv260-smartcam

Use xdputil query or show_dpu command to verify the DPU version.

The vart.conf file needs to be updated to point to the dpu.xclbin of the loaded application

echo "firmware: /lib/firmware/xilinx/kv260-smartcam/kv260-smartcam.xclbin" | sudo tee /etc/vart.conf

Run the app_mt.py with the compiled model. (Use & at the end of the command to run it in the background).

python3 app.py -m binary_classification.xmodel -t 2 &

Model performance on KV260 (DPU 4096)

Check the performance of the board.

xmutil platformstats -p

(DPU 4096) board performance while running compiled model

Performance Results

Model accuracy in different stages:

| Post-training | Frozen Graph | Quantized Model | Hardware Model |
| (Float)       | (Float)      | (INT8)          | (INT8)         |
| ------------- | ------------ | --------------- | -------------- |
| 91.2          | 90.8         | 89.2            | 88.4           |

Comparison between different DPU configurations:

| DPU   | Latency optimized | Throughput optimized |
|       | ----------------- | -------------------- |
|       | FPS    | Power(W) | FPS       | Power(W) |
| ----- | ------ | -------- | --------- | -------- |
| B3136 | 379.50 | 7.03     | 418.55    | 7.37     |
| B4096 | 531.43 | 5.77     | 608.85    | 5.92     |

Latency optimized - executing with 1 thread
Throughput optimized - executing with 2 threads

Eye State Detection model implementation on Kria

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

Initial setup

Eye state detection model

Freeze graph

Vitis-AI

Quantization

Compilation

Deploying on target

Performance Results

References

Code

Github

Credits

Parimala Gutala

Comments

Embed the widget on your own site

Eye State Detection model implementation on Kria

Eye State Detection model implementation on Kria

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

Initial setup

Eye state detection model

Freeze graph

Vitis-AI

Quantization

Compilation

Deploying on target

Performance Results

References

Code

Github

Credits

Parimala Gutala

Comments

Related channels and tags