My daughter starts to study add, subtract, multiply and divide calculation on math class. Unfortunately, the math calculation seems boring for kids. She complains math is less-funny class. She always said she wants a little smart robot to study and play games with her. Only with that, she will spend more time to study. So I try to build “kids math study AI mate”.
I am building this AI mate to inspiring kids’ curiosity and interest to study math. The idea behind it is so that kids will think they are playing game or competing with AI. And the AI mate can also record the accuracy and calculation time. So kids will keep on practicing and trying to break his/her records again and again. They will feel like they are the human hero, who is contesting with AI, just like contesting with AlphaGo.
And this AI mate can also empower teachers through power of AI. Math and science teachers can also teach classes with AI Mate. Teachers can organize math competition in the classroom. Kids will experience the power of AI. With this vivid example, kids will understand the math and science more deeply.
The AI mate can play math game with kids, such as 24 points, magic cube, math calculation puzzles, Sudoku etc. As the initial phase of this project, 24 point calculation game is implemented firstly.
24 point calculation game: Use poker card to play the game. Draw 4 cards randomly. Try to make the number 24 from the four numbers shown(Ace = 1, 2-10=2-10, Jack=11, Queen=12 and King = 13). You can add, subtract, multiply and divide. Use all four numbers on the card, but use each number only once. You do not have to use all four operations.In most case, there are several method to get 24. However, it is possible that no 24 calculation in some rare case. This is a good calculation and brain training game for kids and adults. Have Fun!
The Ultra96-V2 will be served as main controller and AI center.
The PL(programmable logic) part will implement DPU to accelerate the vision detection and recognize the poker card number. The PS(processor system) part will get the number from DPU and calculate all possible 24 algorithm.
step1: Download the Vitis-AI 1.2 image for Ultra96V2 and flash sd cardAvent has provided pre-built images, which can be directly boot and run on Ultra96v2.
The pre-built images include hardware designs built with Vitis with the following DPU configurations:
- ULTRA96-V2 : 1 x B2304 (low RAM usage, low DSP48 usage), 200MHz/400MHz
The pre-built images include compiled models for the following two distinct configurations:
- B2304_lr : B2304 DPU with low RAM usage
You will need to download the following pre-built SD card image for ULTRA96V2 :
- ULTRA96V2 : http://avnet.me/avnet-ultra96v2-vitis-ai-1.2-image (2020-10-22 - MD5SUM = def057a41d72ee460334435234c4264e)
The SD card image contains the hardware design (BOOT.BIN, dpu.xclbin), as well as the petalinux images (boot.scr, image.ub, rootfs.tar.gz). It is provided in image (IMG) format, and contains two partitions:
- BOOT – partition of type FAT (size=400MB)
- ROOTFS – partition of type EXT4
The first BOOT partition was created with a size of 400MB, and contains the following files:
- BOOT.BIN
- boot.scr
- image.ub
- init.sh
- platform_desc.txt
- dpu.xclbin
- ULTRA96V2.hwh
The second ROOTFS partition contains the rootfs.tar.gz content, and is pre-installed with the Vitis-AI runtime packages, as well as the following directories:
- /home/root/dpu_sw_optimize
- /home/root/Vitis-AI, which includes, pre-built DNNDK samples, pre-built VART samples, pre-built Vitis-AI-Library samples
1. Program the board specific SD card image to a 16GB micro SD card (preferred) or 8GB micro SD card
a. On a Windows machine, use Balena Etcher or Win32DiskImager (free opensource software)
Notes: the Ethcher flash process will go through two step2:
1. flash the image;
2. auto check.
The whole process will need about 20 minutes. It is possible that the auto check step will report error and failed. Don't worry! Format the sd card and try again!
b. On a linux machine, use the dd utility
$ sudo dd bs=4M if=Avnet-{platform}-Vitis-AI-1-2-{date}.img of=/dev/sd{X} status=progress conv=fsync
Where {X} is a smaller case letter that specifies the device of your SD card. You can use “df -h” to determine which device corresponds to your SD card.
Step 2 - Boot up Ultra96V2 and Execute the AI demo/sampleInsert the sd card that was create in the previous section to "microSD Socket (J2)".
Install the power supply to "12V PowerInput(J10)".
Press the "Power Button(SW4)".
You will see the LEDs on the board light up and flashing. Great!.. the board is working now!!!
Some of the configuration steps only need to be performed once (after the first boot), including the following:
1. After boot, launch the dpu_sw_optimize.sh script
$ cd ~/dpu_sw_optimize/zynqmp
$ source ./zynqmp_dpu_optimize.sh
This script will perform the following steps:
- Auto resize SD card’s second (EXT4) partition
- QoS configuration for DDR memory
2. [Optionnal] Disable the dmesg verbose output:
$ dmesg -D
This can be re-enabled with the following:
$ dmesg -E
3. Validate the Vitis-AI runtime with the dexplorer utility.
For the ULTRA96V2, this should correspond to the following output:
d$ dexplorer --whoami
[DPU IP Spec]
IP Timestamp : 2020-06-18 12:00:00
DPU Core Count : 1
[DPU Core Configuration List]
DPU Core : #0
DPU Enabled : Yes
DPU Arch : B2304
DPU Target Version : v1.4.1
DPU Freqency : 300 MHz
Ram Usage : Low
DepthwiseConv : Enabled
DepthwiseConv+Relu6 : Enabled
Conv+Leakyrelu : Enabled
Conv+Relu6 : Enabled
Channel Augmentation : Enabled
Average Pool : Enabled
4 Define the DISPLAY environment variable
$ export DISPLAY=:0.0
5. Change the resolution of the DP monitor to a lower resolution, such as 640x480
$ xrandr --output DP-1 --mode 640x480
6 Launch the VART based sample applications
Launch the adas_detection application
$ cd ~/Vitis-AI/VART/samples/adas_detection
$ ./adas_detection ./video/adas.avi ./model_dir_for_B2304_lr/ yolov3_adas_pruned_0_9.elf
Congratulation! The Ultra96V2 Hardware works well now. Ready to deploy your design! Let's Go to the design part.
Step 3 – Neural Network Model Selection and Training with Poker images3.1 Neural Network Model Selection
There are various kind of NN Models for object detection, such as Faster RCNN, sdd_mobilenet, Yolo etc.
Tensorflow official github repo provides object detection models at https://github.com/tensorflow/models
It is pretty easy to train and deploy these model on PC platform. Unfortunately, these model seems not suitable to deploy on Ultra96V2 board.
The DPU in PL can only implement the pure Neural Network Graph part. Preprocessing and Postprocessing part are implemented with Arm in PS.
The tensorflow offical models integrate preprocessing, Graph and Postprocessing.
I have tried faster-rcnn and ssd_mobile net. Both NN models are failed to be quantized by vitis ai. I haven't found feasible method to modify these models. So I have to give up the tensorflow offical models.
Darknet-Yolo provides more flexible to modify the graph.
Luckily, Xilinx provides one modified yolov4 model, which is modified to enable compliance with DPU.
Great! I finally select this feasible object detection model.
3.2 Gather and label pictures
Now the Yolo Object Detection Model is all set up and ready to go, we need to provide the poker images it will use to train a new detection classifier.
Yolo needs hundreds of images of an object to train a good detection classifier. To train a robust classifier, the training images should have random objects in the image along with the desired objects, and should have a variety of backgrounds and lighting conditions. There should be some images where the desired object is partially obscured, overlapped with something else, or only halfway in the picture.
To simplify my Poker Detection classifier, I only take six different objects I want to detect (the card ranks nine, ten, jack, queen, king, and ace – I am not trying to detect suit, just rank). I should support all 13 classes in the near future. I used my cellphone to take pictures of card on its own and also multiple cards in the picture. I know I want to be able to detect the cards when they’re overlapping, so I made sure to have the cards be overlapped in many images.
With all the pictures gathered, it’s time to label the desired objects in every picture. LabelImg is a great tool for labeling images, and its GitHub page has very clear instructions on how to install and use it.
Download and install LabelImg, point it to your \obj directory, and then draw a box around each object in each image. Note: save to yolo format!
3.3 Train the Model
Run the training on colab.
pre-step, clone the darknet and configure tensorflow.
%%capture
!git clone https://github.com/AlexeyAB/darknet.git
%cd darknet
import re
!sed -i 's/OPENCV=0/OPENCV=1/' Makefile
!sed -i 's/GPU=0/GPU=1/' Makefile
!sed -i 's/CUDNN=0/CUDNN=1/' Makefile
# !sed -i 's/CUDNN_HALF=0/CUDNN_HALF=1/' Makefile
!make
!chmod +x ./darknet
%tensorflow_version 1.x
import tensorflow as tf
tf.__version__
First, extract the pre-trained model weight.
%cd ./dk_model
!7z x yolov4-leaky_best.weights.7z.001
Second, change the labels in obj.names to our current labels.
third, modify the path in obj.data.
Notes: use absolute path, otherwise training will report "Error in load_data_detection( ) - OpenCV "
Fourth, generate train_list.txt and valid_list.txt.
Fifth, modify yolov4 cfg file.
There are several parameters of importance here. They control various aspects of the training process. Let's print the first few lines and have a look
!head -n 24 /content/yolotinyv3_medmask_demo/yolov3-tiny_obj.cfg
Let's go over the parameters above:
- The batch parameter dictates the batch size. That one generally remains at 64.
- The subdivisions dictates how many images are loaded into memory. A smaller number translates to faster training. We will use 12. If a CUDA out of memory error is triggered, the subdivisions should be increased to i.e. 16, 24, 32 or some other number. (64 is max) Unless training with a resolution higher than 416, there should be no problem with 12.
- The width and height are by default at 416. Another resolution one can try is 320, or 608.
- The next parameter we care about is max_batches. This determines how long the training process is. Its value should be at least around 2000 for every class used. So for 3 classes, at least 6000.
- The steps are calculated as a function of max_batches. The first value is 0.8 * max_batches and the second value 0.9 * max_batches. In this case it is 8000 * 0.8 and 8000 * 0.9.
For more details about Yolo parameters have a look here: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects There is a lot of material here, so make sure you scroll through if you have questions. Pretty much everything is well explained.
Let's modify the parameters according your class number.
a) change max_batches and steps
b) change three line class=80 to classes=6
c) change three lines filtes:
num_filters = (num_classes + 5) * 3
(6+5)*3 = 33
Everything is ready! !
Let's start the training !
%cd ./dk_files
!source ./train_yolov4_colab.sh
Let's be patient to check the darknet training command here.
Important notes:
Must add “-clear”, otherwise the training will NOT start as the pre-trained weights already exist.
Must add “-don’t_show”, otherwise training will abort as the training process diagram can't be show on colab.
Great! The training is on-going now and will probably last for several hours. Let's wait for the training complete. Be note, the colab will auto clear the run-time after disconnect. Please take care of the training carefully.
Step 4 – Darknet Model ConversionAfter the training completed, new trained model is generated. The next step is to convert the darknet model to a frozen tensorflow graph.
Copy best weights to google drive
!cp /content/darknet/backup/yolov4-leaky_poker_best.weights '/content/drive/My Drive/'
The keras-YOLOv3-model-set repository provides some helpful scripts for this.
david8862 Keras Model Set: https://github.com/david8862/keras-YOLOv3-model-set
git clone https://github.com/david8862/keras-YOLOv3-model-set
In order to convert the model that is populated under dk_model, you can simply cd to the scripts directory and run 'convert_yolov4.sh'. This will create the keras.h5 model as well as the converted frozen TF graph under the tf_model folder.
source ./convert_yolov4.sh
Step 5 – Model Quantization/CompilationThe generated tensorflow pb file should be processed by Vitis-AI to generated deploy elf files.
5.1 Quantization.
Copy the pb file to directory "tf_model" and rename to "tf_model.pb".
Copy part of training images to the folder "yolov4_images". These images will be used for calibration during quantization.
Modify the yolov4_graph_input_keras_fn.py in directory "scripts". Change the calib_image related path if need.
Change directory to the scripts folder and run ./quantize_yolov4.sh
. This will produce a quantized graph in the yolov4_quantized directory a level above the scripts directory.
cd /scripts
source ./quantize_yolov4.sh
5.2 Compilation
Make sure ULTRA96V2 Hardware handoff files are exist in the scripts folder. These files are need for compilation target to Ultra96V2.
If not exist, please refer the following method a) -- g) to generated.
The ULTRA96V2.hwh file is stored in the sd card image of BOOT partition. Please copy the hwh file out from sd card. and upload to docker Env.
a) Create a working directory called “avnet” (or other), and copy to it the hardware handoff files (.hwh) for the platform you wish to compile the models
$ mkdir avnet
$ cd avnet
$ cp /media/ULTRA96V2.hwh .
b) Launch the tools docker from the Vitis-AI directory
$ cd $VITIS_AI_HOME
$ sh -x docker_run.sh xilinx/vitis-ai:latest-cpu
c) Within the docker session, launch the "vitis-ai-caffe" Conda environment
$ conda activate vitis-ai-caffe
(vitis-ai-caffe) $
d) Navigate to the working directory we created earlier
$ cd AI-Model-Zoo/avnet
e) Use the dlet tool to generate your.dcf file
(vitis-ai-caffe) $ dlet -f ULTRA96V2.hwh
f) The previous step will generate a dcf with a name similar to dpu-06-18-2020-12-00.dcf.Rename this file toULTRA96V2.dcf
(vitis-ai-caffe) $ mv dpu*.dcf ULTRA96V2.dcf
g) Create a file named “ULTRA96V2.json” with the following content
{"target": "DPUCZDX8G", "dcf": "./{platform}.dcf", "cpu_arch": "arm64"}
Before run the compilation, check the command as follow:
The ARCH parameter has been modify the "./ULTRA96V2.json" as the files has been copied the scripts folder.
Run the compilation command now.
cd /scripts
source ./compile_yolov4.sh
The DPU deploy elf file is generated finally :-)
We are very closed to the final goal now!
Let's review the python script here.
In the "game_point24_webcam.py", the program will get the images from webcam and run the poker detector on DPU, then get the card region and number result, as the following:
# start the FPS counter
fps = FPS().start()
# loop over the frames from the video stream.
while True:
# Capture image from camera
ret,frame = cam.read()
# Vitis-AI/DPU based poker detector
pokers = dpu_poker_detector.process(frame)
# loop over the pokers
for i,(left,top,right,bottom) in enumerate(pokers):
# draw a bounding box surrounding the object so we can
# visualize it
cv2.rectangle( frame, (left,top), (right,bottom), (0,255,0), 2)
The program will wait four poker cards detected by DPU. Then start the 24 point calculation.
#pass the 4 cards number to 24 point calculation program, print out the calculation result.
if len(data_list) == 4:
str = Point24(data_list).calculate()
In the "pokerdetect.py", the model is executed on the DPU, then the output results are retrieved from memory.
""" Execute model on DPU """
job_id = dpu.execute_async( inputData, outputData )
dpu.wait(job_id)
""" Retrieve output results """
OutputData0 = outputData[0].reshape(1,output0Size)
bboxes = np.reshape( OutputData0, (-1, 4) )
#
outputData1 = outputData[1].reshape(1,output1Size)
scores = np.reshape( outputData1, (-1, 2))
The bounding boxes coordinates are relative to each grid position, so need to be post-processed to add the absolute coordinates of each grid position to bounding box results. The following Python code implements this in a vectorized style in order to keep performance optimal:
""" Get original poker boxes """
gy = np.arange(0,output0Height)
gx = np.arange(0,output0Width)
[x,y] = np.meshgrid(gx,gy)
x = x.ravel()*4
y = y.ravel()*4
bboxes[:,0] = bboxes[:,0] + x
bboxes[:,1] = bboxes[:,1] + y
bboxes[:,2] = bboxes[:,2] + x
bboxes[:,3] = bboxes[:,3] + y
# extract bounding box for each poker
for i, poker in enumerate(pokers):
xmin = max(poker[0] * scale_w, 0 )
ymin = max(poker[1] * scale_h, 0 )
xmax = min(poker[2] * scale_w, imgWidth )
ymax = min(poker[3] * scale_h, imgHeight )
pokers[i] = ( int(xmin),int(ymin),int(xmax),int(ymax) )
return pokers
In the "point24.py", all possible operation and number combination will be checked.
class Point24():
# define the operation array
OPERATIONS = ('+','-','*','/')
# define the string format
FORM_STRS = [
# one bracket case
'( %s %s %s ) %s %s %s %s',
'( %s %s %s %s %s ) %s %s',
'( %s %s %s %s %s %s %s )',
'%s %s ( %s %s %s ) %s %s',
'%s %s ( %s %s %s %s %s )',
'%s %s %s %s ( %s %s %s )',
# two bracket case
'( %s %s %s ) %s ( %s %s %s )',
'( ( %s %s %s ) %s %s ) %s %s',
'( %s %s ( %s %s %s ) ) %s %s',
'%s %s ( ( %s %s %s ) %s %s )',
'%s %s ( %s %s ( %s %s %s ) )',
# three bracket case is duplicated, No need
]
Now, let's run the 24 point game scripts.
$ python3 game_point24_webcam.py
Now, several minutes later, the webcam window will show up. The poker cards number will be recognized. As long as four cards are put on the desk. The program will report the 24 point result exist or not, as well as all the possible algorithm.
This project described how to deploy 24 point game on Ultra96V2. This is only the initial phase. More Games and function will be added later.
Comments