Introduction
Mobile robotics has advanced quite a in this decade. Powerful hardware can now be packed in small devices which can be deployed on robots and drones. The issues with the current technology is that they are fixed in their capability and consume a lot of power of and are not that reactive in real time scenarios. FPGA could solve that problem by using the programmable logic cells to create dedicated hardware for any task needed to be run for hardware acceleration while also consuming less power.
Problem:
Complex and highly accurate neural network are used for obstacle avoidance in mobile robots. These mobile robots face might face sudden dynamic obstacle in their path which the current hardware might not be able to give a prediction in time to avoid a incident with it, leading to the robot being damages. Traditionally Neural networks are trained and stored in floating point 32. While they are highly accurate in inference, they might need so much computational power that they might not be able to run in real time in edge devices. There comes a need to have powerful hardware to be able to run these neural network but that leads to increase in power requirement and weight increase which will reduce the payload and range of these robots and lead to price increase.
Solution:
To solve this issue, we will use the Xilinx Kria KV260 FPGA board along with Vitis-AI software. Vitis-AI allows us to optimize our neural network and deploy it onto the KV260. Users don't need any FPGA perquisite or experience for using Vitis-AI.
Vitis-AI allows us to quantize our deep learning models and deploy it into the dedicated DPU(deep learning processing unit) to hardware accelerate our model.
Lets start first with training.
Training:
For training our neural network, we have chosen Resnet18 classification model, which is trained on the ImageNet Dataset. We have chosen Pytorch as our framework for working on the model. We have taken a pretrained model which is available to us on the Pytorch hub and used transfer learning which replaces the last layer(the classification layer) with our modified layer. Resnet18 final layer contains 1000 output features but since our model is to contain only two layers we will remove the final layer and add our own layer. Below picture shows how the model has been downloaded from the torchvision repository and how we replacing the final layer with our layer.
Next we use load our image dataset through the Pytroch ImageFolder function. This makes sure that 2 classes blocked and free have been created. We then create a train and validation dataset by splitting the data using the random_split function of pytorch. We then pass it onto the data loader to create batches of data. Batch size used here is 8.
Next we create a train loop where we define how our model will take the data and how back propogation and optimizer will work in updating the gradients. We also transfer our data and our model to the gpu to make the training process faster.
NUM_EPOCHS = 15
BEST_MODEL_PATH = 'best_model_resnet18.pth'
best_accuracy = 0.0
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
for epoch in range(NUM_EPOCHS):
for images, labels in iter(train_loader):
images = images.to(device)
labels = labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = F.cross_entropy(outputs, labels)
loss.backward()
optimizer.step()
test_error_count = 0.0
for images, labels in iter(test_loader):
images = images.to(device)
labels = labels.to(device)
outputs = model(images)
test_error_count += float(torch.sum(torch.abs(labels - outputs.argmax(1))))
test_accuracy = 1.0 - float(test_error_count) / float(len(test_dataset))
print('%d: %f' % (epoch, test_accuracy))
if test_accuracy > best_accuracy:
torch.save(model.state_dict(), BEST_MODEL_PATH)
best_accuracy = test_accuracy
During traning the model will save its weight every epoch, we have trained it for 15 epochs but it can be stopped earlier if the model has converged faster.
After training, we can then move on to Quantization and compliation for deployment on VCK5000.
Quantization:
Quantization is the process of converting a floating point model to a model with lower precision. It aims to reduce the computational complexity by using lower precision numbers like Floating Point 16 and INT8. By using lower precision, we can reduce memory bandwith and computational power which will lead to faster data transfer and usage of low power. Vitis AI requires the model to be quantized to INT8 format for deployment on compatable devices.
Inside the Vitis-AI docker, once we activated the pytorch conda environment, we use the torch_quantizer and dump_xmodel imported from pytorch_nndct.apis for quantization. Vitis-AI requires a calibration dataset for quantization. The reason is because unlikw weights and biases that are constant tensor, the activation function and model input and output have a wide range. We need to capture the min, max range of these values for full integer quantization. The calibration dataset need not be large, only images in the range of 100-1000 are enough for quantization.
It is recommended to apply the quantization process on GPU as it will reduce the time required for it, otherwise applying it on cpu will also work.
To proceed with quantization, execute the following code below:
python quantization.py -quant_mode calib -b 16
This will ensure that all the files necessary for quantization and parameter used are saved.
Then execute the line below
python quantization.py -quant_mode test
This will generate an int.xmodel file which is later used for compiling the model for dpu. The accuracy that will be printed onto the screen at the end of terminal will be the correct accuracy.
Compilation:
After finishing the quantization step, before moving the model to the board, we will have to compile the model for using it with the boards respective DPU.
To compile the model, please execute the given command on the command line.
vai_c_xir -x build/quant_model/ResNet_int.xmodel -a /opt/vitis_ai/compiler/arch/DPUCZDX8G/KV260/arch.json -o build/quant_model/ -n collision_avoidance
This will generate the final xmodel file required for running the inference code on the Kria.
Inference:
After the inference and compliation step have been completed, we are now ready to run the model onto the FPGA. In the Inference code, we take the image captured by the robot as input, resize it to 224, 224 and multiply the image by the input scaling value obtained from the xmodel file to be able to convert the image to INT8 format. We also set our output data to INT8 format. We pass onto the images to the model where it gives us the output. We can use np.argmax to avoid using the output scale value.
The below screenshot shows the inference speed for KV260
Conclusion:
From the above swe can see that KV260 takes only 0.0765 for one frame. It also consumes less power compared to the other devices in its category and the hardware logic can be customized for accelerating any Neural Network.
Big thanks to Xilinx and all the technical team for helping me making my idea become real.
Comments