Step One: Modifying training model
Step Two: Training
Setting Densenet Callback Parameters
Story Three: Converting Densenet Keras model to tensorflow model
Story Four: Converting Yolo Darknet model to Caffe model
Story Five: Preparing Ultra96 Vitis-AI DPU Files
Story Six: Compiling Kernel
Story Seven: Running on Ultra96

Team Eagle:

john wang

•

chu wang

•

Charlie XU

Created November 28, 2020 © Apache-2.0

How to apply a new machine learning model on ultra96

We have converted, trained, tested, quantized, compiled and run enet, densenet and yolo models on ultra96 successfully.

IntermediateProtipOver 8 days51

How to apply a new machine learning model on ultra96

Things used in this project

Hardware components

Tria Technologies Ultra96-V2

Webcam, Logitech® HD Pro

Converter, USB 3.0 to Gigabit Ethernet

Software apps and online services

AMD PYNQ Framework

Story

FPGA has an advantage of low latency and powerful computing speed, but FPGA DPU has a specific protocol and can not apply a machine learning model directly. This problem prevents the application of FPGA AI. This project is showing the procedure how to convert a new machine learning model, then train, quantize and compile the model, then run it on ultra96 for the new models of enet, darknet and yolo. The procedure will tell people how to apply a new machine learning model on xilinx FPGA.

Step One: Modifying training model

Enet model:

Replacing the un-pooling layer with deconvolution layer in the decoder module
Replacing all PReLU with ReLU
Removing spatial dropout layers
Replacing Batchnorm layers with a merged Batchnorm + Scale layer
Position Batchnorm layers in parallel with ReLU

In UNet-full/Unet-lite models Batchnorm/scale layer combinations were inserted before relu layers (after d0c, d1c, d2c, and d3c) as the DPU doesn't support the data flow from Convolution to both the Concat and relu simultaneously

Densenet model:

Replacing Stochastic Gradient Descent (SGD) optimizer with RMSProp optimizer
Using a 3x3 kernel with a stride length of 1 and the first max pooling layer is ommitted due to dataset CIFAR-10
Replacing GlobalAveragePooling2D with AveragePooling2D + Flatten

Step Two: Training

Darknet Project directory

Enet Project directory

Setting Densenet Callback Parameters

chkpt_call=ModelCheckpoint(filepath=keras_hdf5+"epoch.{epoch:03d}.val_acc.{val_acc:.2f}.h5", monitor='val_acc',verbose=1,save_best_only=True)
tb_call=TensorBoard(log_dir=tboard,batch_size=batchsize,update_freq='epoch')
lr_scheduler_call=LearningRateScheduler(schedule=step_decay, verbose=1)
lr_plateau_call=ReduceLROnPlateau(factor=np.sqrt(0.1),cooldown=0,patience=5, min_lr=0.5e-6)
callbacks_list = [tb_call, lr_scheduler_call, lr_plateau_call, chkpt_call]

Saving and Keeping Training Program

model_path=keras_hdf5
listfile = [i for i in os.listdir(model_path) if i.endswith("h5")]
print("listfile = {}".format(listfile))
# model sorting
listfile.sort()
# model reversing sorting
listfile=listfile[::-1]
print("listfile = {}".format(listfile))
if len(listfile) != 0:
# taking the latest model
model_path = model_path + listfile[0]
model = load_model(model_path)
initial_epoch=int(listfile[0].split(".")[1])
print("initial_epoch = %d"%int(initial_epoch))
else:
# if there is not any trained model, producing new model and setting epoch = 0
model = densenetx(input_shape=(input_height,input_width,input_chan),classes=10,theta=0.5,drop_rate=0.2,k=12,convlayers=[16,16,16])
initial_epoch = 0

Entering VITIS-AI GPU Image

Starting Training

(vitis-ai-tensorflow)john@john-wang:/workspace/DenseNet3/trainrestore/trainrestore$source setenv.sh
(vitis-ai-tensorflow)john@john-wang:/workspace/DenseNet3/trainrestore/trainrestore$./trainrestore.sh

Note: The first command is source setenv.sh rather than ./setenv.sh，because ./ command is only valid in the temperate shell created. Here Batchsize is 50, which should be set by your computer RAM.

Tensorboard Result

john@john-wang:~/Vitis-AI_1.2/DenseNet3/trainrestore/trainrestore/build$ tensorboard --logdir=tb_logs

tb_logs is the saving directory of tensorboard data and this command is run in the upper directory, otherwise request full directory.

Accuracy:

Loss:

Story Three: Converting Densenet Keras model to tensorflow model

Vitis AI tool can not recept Keras checkpoints，and need to convert to the frozen model compatible with TensorFlow.

Two Steps:

1. Convert HDF5 file to TensorFlow checkpoint.

2. Frozen TensorFlow checkpoint.

./2_keras2tf.sh

'frozen_graph.pb' will be put in ./files/build/freeze.

./3_eval_frozen.sh

Story Four: Converting Yolo Darknet model to Caffe model

john@john-wang:~/Vitis-AI_1.2/YOLOv3/example_yolov3$ bash 0_convert.sh
$ python ../yolo_convert.py \
0_model_darknet/yolov3.cfg 0_model_darknet/yolov3.weights
1_model_caffe/v3.prototxt 1_model_caffe/v3.caffemodel
$ python ../yolo.py
0_model_darknet/yolov3-tiny.cfg 0_model_darknet/yolov3-tiny.weights
1_model_caffe/v3-tiny.prototxt 1_model_caffe/v3-tiny.caffemodel

Story Five: Preparing Ultra96 Vitis-AI DPU Files

Prj_config文件

Ultra96.json

DPU Utilization

Story Six: Compiling Kernel

Densenet Kernel:

BOARD=Ultra96
MODEL_NAME=tf_densenet_imagenet_30_30_7.7G
MODEL_UNZIP=${MODEL_NAME}
vai_c_tensorflow \
--frozen_pb ${MODEL_UNZIP}/quantized/deploy_model.pb \
--arch /opt/vitis_ai/compiler/arch/dpuv2/${BOARD}/${BOARD}.json \
--output_dir ./model \
--net_name tf_${MODEL}

Prepare compilation

$ cd example_yolov3
$ cp 1_model_caffe/v3.caffemodel ./2_model_for_quantize/

Quantizing Yolov3 Kernel

(vitis-ai-caffe) john@john-virtual-machine:/workspace/YOLOv3/example_yolov3$vai_q_caffe quantize -model 2_model_for_quantize/v3.prototxt -weights 2_model_for_quantize/v3.caffemodel -sigmoided_layers layer81-conv,layer93-conv,layer105-conv -output_dir 3_model_after_quantize

Quantizing Yolov3-tiny Kernel

(vitis-ai-caffe) john@john-virtual-machine:/workspace/YOLOv3/example_yolov3$ vai_q_caffe quantize -model 2_model_for_quantize/v3-tiny.prototxt -weights 2_model_for_quantize/v3-tiny.caffemodel -sigmoided_layers layer15-conv,layer22-conv -output_dir 3_model_after_quantize
BOARD=Ultra96
MODEL_NAME=tf_yolov3_voc_416_416_65.63G
MODEL_UNZIP=${MODEL_NAME}
vai_c_tensorflow \
--frozen_pb ${MODEL_UNZIP}/quantized/deploy_model.pb \
--arch /opt/vitis_ai/compiler/arch/dpuv2/${BOARD}/${BOARD}.json \
--output_dir ./model \
--net_name tf_${MODEL}
BOARD=Ultra96
MODEL_NAME=tf_yolov3_voc_416_416_65.63G
MODEL_UNZIP=${MODEL_NAME}
vai_c_tensorflow \
--frozen_pb ${MODEL_UNZIP}/quantized/deploy_model.pb \
--arch /opt/vitis_ai/compiler/arch/dpuv2/${BOARD}/${BOARD}.json \
--output_dir ./model \
--net_name tf_${MODEL}

Story Seven: Running on Ultra96

Ultra96 Densenet Project Directory

Ultra96 Yolo Project Directory

Running Densenet Model

Ultra96 Integrated System

Image Target Detection

Yolov3 Target Dectection on Spot

Running Enet Model on Ultra96

Debugging on Ultra96

Segmentation Picture 1

Segmentation Picture 2

Segmentation Picture 3

PYNQ Design Flow

https://www.bilibili.com/video/BV1NX4y1u775/

If you can not open the video below, pls double click the link above.

Code

#!/bin/bash

# train, evaluate and save trained keras model
train() {
  python3 trainrestore.py \
    --input_height ${INPUT_HEIGHT} \
    --input_width  ${INPUT_WIDTH} \
    --input_chan   ${INPUT_CHAN} \
    --epochs       ${EPOCHS} \
    --learnrate    ${LEARNRATE} \
    --batchsize    ${BATCHSIZE} \
    --tboard       ${TB_LOG} \
    --keras_hdf5   ${KERAS}/
#    --keras_hdf5   ${KERAS}/${K_MODEL}
}

echo "-----------------------------------------"
echo "TRAINING STARTED"
echo "-----------------------------------------"

#rm -rf ${KERAS}
mkdir -p ${KERAS}
#rm -rf ${TB_LOG}
mkdir -p ${TB_LOG}
train 2>&1 | tee ${LOG}/${TRAIN_LOG}

echo "-----------------------------------------"
echo "TRAINING FINISHED"
echo "-----------------------------------------"

!/bin/bash

set -e

if [ "$#" -eq 2 ]; then
    BOARD=$1
    MODEL_NAME=$2
        echo "./compile.sh $MODEL_NAME"
#        MODEL_UNZIP=$2 
else
        BOARD=Ultra96
#        MODEL_NAME=cf_yolov3_voc_608_608_65.42G
        MODEL_NAME=cf_yolotiny_voc_416_416_11.2G
#        MODEL_NAME=tf_yolov3_voc_416_416_65.63G
#        MODEL_NAME=cf_yolov3_bdd_288_512_53.7G
#        MODEL_NAME=cf_yolov3_cityscapes_256_512_0.9_5.46G
##        MODEL_NAME=cf_ssdpedestrian_coco_360_640_0.97_5.9G
#        cf_yolov3_voc_416_416_65.42G
#        MODEL_NAME=tf_densenet_imagenet_512_512_7.7G
#       MODEL_NAME=cf_segmentation_imagenet_512_512_7.7G
#        MODEL_NAME=cf_inceptionv4_imagenet_299_299_24.5G_79.58
#	 MODEL_NAME=tf_resnetv1_152_imagenet_224_224_21.83G
#	echo "Error: please provide BOARD and MODEL_NAME as arguments."
	echo "Default: ./compile.sh $MODEL_NAME"
#	exit 1
fi

if [ $BOARD = "Ultra96" ] && [ ! -e dpu.hwh ]; then
	echo "Error: please make sure dpu.hwh is in the working directory."
	exit 1
fi

VAI_VERSION=1.1
#MODEL_ZIP=$(echo ${MODEL_NAME} | sed 's/_[1-9\.]\+G_/_/g').zip
#MODEL_UNZIP=$(echo ${MODEL_NAME} | sed "s/\(.*\)_${VAI_VERSION}\(.*\)/\1\2/")
MODEL_UNZIP=${MODEL_NAME}
echo "MODEL_UNZIP NAME: $MODEL_UNZIP"
MODEL=$(echo $MODEL_NAME | cut -d'_' -f2)
echo "MODEL NAME: $MODEL"
FRAMEWORK=$(echo $MODEL_NAME | cut -d'_' -f1)
echo "FRAMEWORK $FRAMEWORK"
# Activate Vtisi AI conda environment
source /etc/profile.d/conda.sh
if [ $FRAMEWORK = 'cf' ]; then
	conda activate vitis-ai-caffe
elif [ $FRAMEWORK = 'tf' ]; then
	conda activate vitis-ai-tensorflow
else
	echo "Error: currently only caffe and tensorflow are supported."
	exit 1
fi

# If custom Ultra96 hwh file is provided, add DPU support
if [ $BOARD = "Ultra96" ]; then
	sudo mkdir -p /opt/vitis_ai/compiler/arch/dpuv2/Ultra96
	sudo cp -f Ultra96.json \
		/opt/vitis_ai/compiler/arch/dpuv2/Ultra96/Ultra96.json
	dlet -f dpu.hwh
        sudo rm -f /opt/vitis_ai/compiler/arch/dpuv2/${BOARD}/*.dcf
	sudo cp *.dcf /opt/vitis_ai/compiler/arch/dpuv2/${BOARD}/${BOARD}.dcf
fi

# ZCU111 and ZCU102 use equivalent DPU configurations
if [ $BOARD = "ZCU111" ]; then
	BOARD=ZCU102
fi

## Download model if it doesn't already exist in workspace
#if [ ! -f $MODEL_ZIP ]; then
#	wget -O ${MODEL_ZIP} \
#	"https://www.xilinx.com/bin/public/openDownload?filename=${MODEL_ZIP}"
#fi
#unzip -o ${MODEL_ZIP}

# Compile the model
if [ $FRAMEWORK = 'cf' ]; then
	vai_c_caffe \
		--prototxt ${MODEL_UNZIP}/quantized/deploy.prototxt \
		--caffemodel ${MODEL_UNZIP}/quantized/deploy.caffemodel \
		--arch /opt/vitis_ai/compiler/arch/dpuv2/${BOARD}/${BOARD}.json \
		--output_dir ./model \
		--net_name ${MODEL}
        sudo cp ./model/dpu_${MODEL}_0.elf ./modelbak

elif [ $FRAMEWORK = 'tf' ]; then
        echo "FRAMEWORK tensorflow"
	vai_c_tensorflow \
		--frozen_pb ${MODEL_UNZIP}/quantized/deploy_model.pb \
		--arch /opt/vitis_ai/compiler/arch/dpuv2/${BOARD}/${BOARD}.json \
		--output_dir ./model \
		--net_name tf_${MODEL}
#        sudo cp ./model/dpu_tf_${MODEL}_0.elf ./modelbak
#        sudo cp ./model/dpu_${MODEL}.elf ./modelbak

else
	echo "Error: currently only caffe and tensorflow are supported."
	exit 1
fi

# data channel order: RGB(0~255) input = input / 255 crop: crop the central region of the image with an area containing 87.5%
# of the original image. resize: 224 * 224 (tf.image.resize_bilinear(image, [height, width], align_corners=False)) input = 2*(input - 0.5)

#1. data channel order: RGB(0~255)
#2. resize: short side reisze to 256 and keep the aspect ratio.
#3. center crop: 224 * 224
#4. input = input / 255
#5. input = 2*(input - 0.5) 

from ctypes import *
import cv2
import numpy as np
from dnndk import n2cube
import os
#import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
import time
#import preprocess
#import queue
import sys

try:
    pyc_libdputils = cdll.LoadLibrary("libn2cube.so")
except Exception:
    print('Load libn2cube.so failed\nPlease install DNNDK first!')

top = 1 
resultname = "image.list.result"
threadPool = ThreadPoolExecutor(max_workers=2,)
#scale = 1
#shortsize = 256

classes = ['airplane','automobile','bird','cat','deer','dog','frog','horse','ship','truck']  
listPredictions = []

#from IPython.display import display
#from PIL import Image

#path = os.path.join(image_folder, listimage[2])
#print("path = %s" % path)
#img = cv2.imread(path)
#display(Image.open(path))

_R_MEAN = 123.68
_G_MEAN = 116.78
_B_MEAN = 103.94

MEANS = [_B_MEAN,_G_MEAN,_R_MEAN]

def BGR2RGB(image):
  # B, G, R = cv2.split(image)
  # image = cv2.merge([R, G, B])
  # image = image[:,:,::-1] 
  image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
  return image

def resize_shortest_edge(image, size):
  H, W = image.shape[:2]
  if H >= W:
    nW = size
    nH = int(float(H)/W * size)
  else:
    nH = size
    nW = int(float(W)/H * size)
  return cv2.resize(image,(nW,nH))

def central_crop(image, crop_height, crop_width):
  image_height = image.shape[0]
  image_width = image.shape[1]
  offset_height = (image_height - crop_height) // 2
  offset_width = (image_width - crop_width) // 2
  return image[offset_height:offset_height + crop_height, offset_width:
               offset_width + crop_width, :]

def normalize(image):
  image = image.astype(np.float32)
  image=image/256.0
  image=image-0.5
  image=image*2.0
  return image

#def preprocess_fn(image_path):
#    '''
#    Image pre-processing.
#    Rearranges from BGR to RGB then normalizes to range 0:1
#    input arg: path of image file
#    return: numpy array
#    '''
#    image = cv2.imread(image_path)
#    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
#    image = image/255.0
#    return image

def preprocess_fn(image):
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image = image.astype(np.float32)
#    print(f"image.type = {type(image)}")
    image = image/255.0
    return image

#def preprocess_fn(image, crop_height, crop_width):
#    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
##    cv2.imshow("test", image)
##    cv2.waitKey(0)  
#    image = resize_shortest_edge(image, 256)
##    print("shape in = {}".format(image.shape))  
#    image = central_crop(image, crop_height, crop_width)
##    print("shape cr = {}".format(image.shape))
#    image = normalize(image)   
#    return image  

#def dpuSetInputImageWithScale(task, nodeName, image, mean, scale, height, width, channel, shortsize, idx=0):
#    (imageHeight, imageWidth, imageChannel) = image.shape
#    if height == imageHeight and width == imageWidth:
#        newImage = image
#    else:
#        newImage = preprocess_fn(image, height, width)
#    return newImage

def parameter(task, nodeName, idx=0):
    #inputscale =  n2cube.dpuGetInputTensorScale(task, nodeName, idx)
    #print("inputscale = %f"%inputscale)
    channel = n2cube.dpuGetInputTensorChannel(task, nodeName, idx)
    output = (c_float * channel)()
    outputMean = POINTER(c_float)(output)
    pyc_libdputils.loadMean(task, outputMean, channel)
    height = n2cube.dpuGetInputTensorHeight(task, nodeName, idx)
    print("height = %d"%height)
    width = n2cube.dpuGetInputTensorWidth(task, nodeName, idx)
    print("width = %d"%width)    
    for i in range(channel):
        outputMean[i] = float(outputMean[i])
#        print("outputMean[%i] = %f"%(i,outputMean[i]))
    return height, width, channel, outputMean
                 
def predict_label(img, task, inputscale, mean, height, width, inputchannel, shortsize, KERNEL_CONV_INPUT):
#    imageRun = preprocess_fn(img, height, width)
    imageRun = preprocess_fn(img)
#    imageRun = dpuSetInputImageWithScale(task, KERNEL_CONV_INPUT, img, mean, inputscale, height, width, inputchannel, shortsize, idx=0)
#    n2cube.dpuGetInputTensor(task, KERNEL_CONV_INPUT)
    imageRun = imageRun.reshape((imageRun.shape[0]*imageRun.shape[1]*imageRun.shape[2]))
#    input_len = 150528
#    print("imageRUN = {}".format(imageRun))
#    input_len = len(imageRun)
#    print(f"imageRun = {imageRun.shape}")
#    print(f"input_len = {input_len}")    
    return imageRun

def TopK(softmax, imagename, fo, correct, wrong):
    for i in range(top):
         num = np.argmax(softmax)
#         print("softmax = %f" % softmax[num])    
#         argmax = np.argmax((out_q[i]))
         prediction = classes[num]  
#         print(prediction)
#         softmax[num] = 0
#         num -1, should notice
#         num = num -1
#         fo.write(imagename+" "+str(num)+"\n")  
         ground_truth, _ = imagename.split('_')
         fo.write(imagename+' p: '+prediction+' g: '+ground_truth+' : '+str(softmax[num])+'\n')
         if (ground_truth==prediction):
            correct += 1
#            print(f"correct = {correct}")
         else:
            wrong += 1
#            print(f"wrong = {wrong}")
    return correct, wrong   
#sem=threading.BoundedSemaphore(1)
def run_dpu_task(outsize, task, outputchannel, conf, outputscale, listimage, imageRun, KERNEL_CONV_INPUT, KERNEL_FC_OUTPUT): 
    input_len = len(imageRun)
#    print(f"input_len = {input_len}")
    n2cube.dpuSetInputTensorInHWCFP32(task,KERNEL_CONV_INPUT,imageRun,input_len)
    n2cube.dpuRunTask(task)
#    outputtensor = n2cube.dpuGetOutputTensorInHWCFP32(task, KERNEL_FC_OUTPUT, outsize)
#    print(outputtensor)
#    print(outputchannel)
#    print(outputscale)
    softmax = n2cube.dpuRunSoftmax(conf, outputchannel, outsize//outputchannel, outputscale)
#    print(f"softmax = {softmax}")
    return softmax, listimage

def run(image_folder, shortsize, KERNEL_CONV, KERNEL_CONV_INPUT, KERNEL_FC_OUTPUT, inputscale):

    start = time.time()
#    listimage = [i for i in os.listdir(image_folder) if i.endswith("JPEG")]
    listimage = [i for i in os.listdir(image_folder) if i.endswith("jpg")]
    listimage.sort()
#    wordstxt = os.path.join(image_folder, "words.txt")
#    with open(wordstxt, "r") as f:
#        lines = f.readlines()
    fo = open(resultname, "w")
    n2cube.dpuOpen()
    kernel = n2cube.dpuLoadKernel(KERNEL_CONV)
    task = n2cube.dpuCreateTask(kernel, 0)
    height, width, inputchannel, mean = parameter(task, KERNEL_CONV_INPUT)
#    print("mean = %f"%mean[0])
    outsize = n2cube.dpuGetOutputTensorSize(task, KERNEL_FC_OUTPUT)
#    print("size = %d"%size)
    outputchannel = n2cube.dpuGetOutputTensorChannel(task, KERNEL_FC_OUTPUT)
#    print("outputchannel = %d"%outputchannel)
    conf = n2cube.dpuGetOutputTensorAddress(task, KERNEL_FC_OUTPUT)
#    print("conf = {}".format(conf))
#    print("inputscale = %f"%inputscale)
    inputscale = n2cube.dpuGetInputTensorScale(task,KERNEL_CONV_INPUT)
#    print("inputscalenow = %f"%inputscale)
    outputscale = n2cube.dpuGetOutputTensorScale(task, KERNEL_FC_OUTPUT)
#    print("outputscale = %f"%outputscale)  
    imagenumber = len(listimage) 
    print("\nimagenumber = %d\n"%imagenumber)
    softlist = []
#    imagenumber = 1000
    correct = 0
    wrong = 0
    for i in range(imagenumber):
        print(f"i = {i+1}") 
        print(listimage[i]) 
#        path = os.path.join(image_folder, listimage[i])
#        if i % 50 == 0:
#        print("\r", listimage[i], end = "") 
        path = image_folder + listimage[i]
        img = cv2.imread(path)
        imageRun = predict_label(img, task, inputscale, mean, height, width, inputchannel, shortsize, KERNEL_CONV_INPUT)
        input_len = len(imageRun)
#        print(f"input_len = {input_len}")     
#        soft = threadPool.submit(run_dpu_task, outsize, task, outputchannel, conf, outputscale, listimage[i], imageRun, KERNEL_CONV_INPUT, KERNEL_FC_OUTPUT)
#        softlist.append(soft)
#    for future in as_completed(softlist):
#        softmax, listimage = future.result()
        softmax, listimage[i] = run_dpu_task(outsize, task, outputchannel, conf, outputscale, listimage[i], imageRun, KERNEL_CONV_INPUT, KERNEL_FC_OUTPUT)
        correct, wrong = TopK(softmax, listimage[i], fo, correct, wrong)
        print("")

    fo.close()
    accuracy = correct/imagenumber
    print('Correct:',correct,' Wrong:',wrong,' Accuracy:', accuracy)    
    n2cube.dpuDestroyTask(task)
    n2cube.dpuDestroyKernel(kernel)
    n2cube.dpuClose()
    print("")

    end = time.time()
    total_time = end - start 
    print('\nAll processing time: {} seconds.'.format(total_time))
    print('\n{} ms per frame\n'.format(10000*total_time/imagenumber))
   
#threadPool.shutdown(wait=True)          
#criteria = (Accruacy-top1% -68.5)/15)*0.4+(10/latencyms)*0.6

#!/bin/bash

set -e

model=tf_densenet
overlays=overlays_300M2304
cd ${overlays}
#pwd
#aarch64-linux-gnu-gcc -fPIC -shared dpu_${model}_0.elf -o libdpumodel${model}.so
aarch64-linux-gnu-gcc -fPIC -shared dpu_${model}.elf -o libdpumodel${model}.so
#echo "aarch64-linux-gnu-gcc -fPIC -shared dpu_${model}_0.elf -o libdpumodel{$model}.so"
echo "aarch64-linux-gnu-gcc -fPIC -shared dpu_${model}.elf -o libdpumodel{$model}.so"
cp libdpumodel${model}.so /usr/lib/
ls -l /usr/lib/libdpu*.so
cd ..
pwd
cp ./${overlays}/* /usr/local/lib/python3.6/dist-packages/pynq_dpu/overlays/
#python3 overlay.py

'''
 Copyright 2020 Xilinx Inc.

 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
'''

'''
Trains the DenseNetX model on the CIFAR-10 dataset

Author: Mark Harvey
'''
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import os
import sys
import argparse

from datadownload import datadownload

# Silence TensorFlow messages
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

# workaround for TF1.15 bug "Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR"
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

import tensorflow as tf
from tensorflow.keras.optimizers import RMSprop, SGD
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard, LearningRateScheduler, ReduceLROnPlateau
from tensorflow.keras.preprocessing.image import ImageDataGenerator

from DenseNetX import densenetx

from tensorflow.keras.models import load_model
#from tensorflow.keras.models import get_init_epoch

DIVIDER = '-----------------------------------------'

def train(input_height,input_width,input_chan,batchsize,learnrate,epochs,keras_hdf5,tboard):

    def step_decay(epoch):
        """
        Learning rate scheduler used by callback
        Reduces learning rate depending on number of epochs
        """
        lr = learnrate
        if epoch > 150:
            lr /= 1000
        elif epoch > 120:
            lr /= 100
        elif epoch > 80:
            lr /= 10
        elif epoch > 2:
            lr /= 2
 # test lr, must comment for train
        lr = lr/ 1000
        return lr
    

    # CIFAR10 dataset has 60k images. Training set is 50k, test set is 10k.
    # Each image is 32x32x8bits
    (x_train, y_train), (x_test, y_test) = datadownload()
    print ('Dataset downloaded and pre-processed')

    '''
    -----------------------------------------------
    CALLBACKS
    -----------------------------------------------
    '''

    # chkpt_call = ModelCheckpoint(filepath=keras_hdf5,
                                 # monitor='val_acc',
                                 # verbose=1,
                                 # save_best_only=True)
     
    chkpt_call = ModelCheckpoint(filepath=keras_hdf5+"epoch.{epoch:03d}.val_acc.{val_acc:.2f}.h5",
                                  monitor='val_acc',
                                  verbose=1,
                                  save_best_only=True)                  
 
    tb_call = TensorBoard(log_dir=tboard,
                          batch_size=batchsize,
                          update_freq='epoch')

    lr_scheduler_call = LearningRateScheduler(schedule=step_decay,
                                              verbose=1)

    lr_plateau_call = ReduceLROnPlateau(factor=np.sqrt(0.1),
                                        cooldown=0,
                                        patience=5,
                                        min_lr=0.5e-6)

    callbacks_list = [tb_call, lr_scheduler_call, lr_plateau_call, chkpt_call]

    model_path=keras_hdf5
    listfile = [i for i in os.listdir(model_path) if i.endswith("h5")] 
    print("listfile = {}".format(listfile))
    listfile.sort()
    listfile=listfile[::-1]
    print("listfile = {}".format(listfile))

#    if listfile is not None:
#    while not listfile:
    if len(listfile) != 0:
       model_path = model_path + listfile[0]  
       model = load_model(model_path)  
       # latest=tf.train.latest_checkpoint(model_path)
       #    json_string = model.to_json()
       #    print(json_string)
       # Finding the epoch index from which we are resuming
       # initial_epoch = get_init_epoch(checkpoint_path)
       initial_epoch=int(listfile[0].split(".")[1])
       print("initial_epoch = %d"%int(initial_epoch))
       # Calculating the correct value of count
       # count = initial_epoch*batchsize
       # Update the value of count in callback instance
       # callbacks_list[1].count = count 
    else:
       model = densenetx(input_shape=(input_height,input_width,input_chan),classes=10,theta=0.5,drop_rate=0.2,k=12,convlayers=[16,16,16])
       initial_epoch = 0  
        
    # prints a layer-by-layer summary of the network
    print('\n'+DIVIDER)
    print(' Model Summary')
    print(DIVIDER)
#    print(model.summary())
    print("Model Inputs: {ips}".format(ips=(model.inputs)))
    print("Model Outputs: {ops}".format(ops=(model.outputs)))

    model.summary()

    '''
    -----------------------------------------------
    TRAINING
    -----------------------------------------------
    '''

    '''
    Input image pipeline for training, validation
    
     data augmentation for training
       - random rotation
       - random horiz flip
       - random linear shift up and down
    '''
    data_augment = ImageDataGenerator(rotation_range=10,
                                      horizontal_flip=True,
                                      height_shift_range=0.1,
                                      width_shift_range=0.1,
                                      shear_range=0.1,
                                      zoom_range=0.1)

    train_generator = data_augment.flow(x=x_train,
                                        y=y_train,
                                        batch_size=batchsize,
                                        shuffle=True)
                                  
    '''
    Optimizer
    RMSprop used in this example.
    SGD  with Nesterov momentum was used in original paper
    '''
    #opt = SGD(lr=learnrate, momentum=0.9, nesterov=True)
    opt = RMSprop(lr=learnrate)
    
    model.compile(optimizer=opt,
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    # calculate number of steps in one train
    # train_steps = train_generator.n//train_generator.batch_size
    train_steps = train_generator.n//train_generator.batch_size/100
    print(f"train_generator.n = {train_generator.n}")
    print("train_steps = %d"%train_steps)
    # run training
    model.fit_generator(generator=train_generator,
                        epochs=epochs,
                        steps_per_epoch=train_steps,
                        validation_data=(x_test, y_test),
                        callbacks=callbacks_list,
                        verbose=1, initial_epoch=initial_epoch)

    print("\nTensorBoard can be opened with the command: tensorboard --logdir={dir} --host localhost --port 6006".format(dir=tboard))

    print('\n'+DIVIDER)
    # print(' Evaluate model accuracy with validation set..')
    # print(DIVIDER)

    # '''
    # -----------------------------------------------
    # EVALUATION
    # -----------------------------------------------
    # '''

    # scores = model.evaluate(x=x_test,y=y_test,batch_size=50, verbose=0)
    # print ('Evaluation Loss    : ', scores[0])
    # print ('Evaluation Accuracy: ', scores[1])


    # '''
    # -----------------------------------------------
    # PREDICTIONS
    # -----------------------------------------------
    # '''

    # # make predictions
    # predictions = model.predict(x_test,
                                # batch_size=batchsize,
                                # verbose=1)

    # # check accuracy
    # correct = 0
    # wrong = 0
    # for i in range(len(predictions)):
        # pred = np.argmax(predictions[i])
        # if (pred== np.argmax(y_test[i])):
            # correct+=1
        # else:
            # wrong+=1

    # print ('Correct predictions:',correct,' Wrong predictions:',wrong,' Accuracy:',(correct/len(predictions)))

    return


def run_main():
    
    print('\n'+DIVIDER)
    print('Keras version      : ',tf.keras.__version__)
    print('TensorFlow version : ',tf.__version__)
    print(sys.version)
    print(DIVIDER)

    # construct the argument parser and parse the arguments
    ap = argparse.ArgumentParser()
    ap.add_argument('-ih', '--input_height',
                    type=int,
                    default='32',
    	            help='Input image height in pixels.')
    ap.add_argument('-iw', '--input_width',
                    type=int,
                    default='32',
    	            help='Input image width in pixels.')
    ap.add_argument('-ic', '--input_chan',
                    type=int,
                    default='3',
    	            help='Number of input image channels.')
    ap.add_argument('-b', '--batchsize',
                    type=int,
                    default=100,
    	            help='Training batchsize. Must be an integer. Default is 100.')
    ap.add_argument('-e', '--epochs',
                    type=int,
                    default=300,
    	            help='number of training epochs. Must be an integer. Default is 300.')
    ap.add_argument('-lr', '--learnrate',
                    type=float,
                    default=0.001,
    	            help='optimizer initial learning rate. Must be floating-point value. Default is 0.001')
    ap.add_argument('-kh', '--keras_hdf5',
                    type=str,
                    default='./model.hdf5',
    	            help='path of Keras HDF5 file - must include file name. Default is ./model.hdf5')
    ap.add_argument('-tb', '--tboard',
                    type=str,
                    default='./tb_logs',
    	            help='path to folder for saving TensorBoard data. Default is ./tb_logs.')    
    args = ap.parse_args()
 
    args.learnrate = 0.002
    # final epochs
    args.epochs = 5

    print(' Command line options:')
    print ('--input_height : ',args.input_height)
    print ('--input_width  : ',args.input_width)
    print ('--input_chan   : ',args.input_chan)
    print ('--batchsize    : ',args.batchsize)
    print ('--learnrate    : ',args.learnrate)
    print ('--epochs       : ',args.epochs)
    print ('--keras_hdf5   : ',args.keras_hdf5)
    print ('--tboard       : ',args.tboard)
    print(DIVIDER)

    train(args.input_height,args.input_width,args.input_chan,args.batchsize,args.learnrate,args.epochs,args.keras_hdf5,args.tboard)


if __name__ == '__main__':
    run_main()

Credits

Comments

Please log in or sign up to comment.

How to apply a new machine learning model on ultra96

Things used in this project

Hardware components

Software apps and online services

Story

Step One: Modifying training model

Step Two: Training

Setting Densenet Callback Parameters

Story Three: Converting Densenet Keras model to tensorflow model

Story Four: Converting Yolo Darknet model to Caffe model

Story Five: Preparing Ultra96 Vitis-AI DPU Files

Story Six: Compiling Kernel

Story Seven: Running on Ultra96

Schematics

vivado hardware platform design

Code

Yolo program on ultra96

Densenet training bat program

Kernel compiling program

Densenet program on ultra96

python link library

Ultra96 DPU prj_config

DPU Ultra96.json

Enet program on ultra96

Densenet training python program

Credits

john wang

chu wang

Charlie XU

Comments

Embed the widget on your own site

How to apply a new machine learning model on ultra96

How to apply a new machine learning model on ultra96

Things used in this project

Hardware components

Software apps and online services

Story

Step One: Modifying training model

Step Two: Training

Setting Densenet Callback Parameters

Story Three: Converting Densenet Keras model to tensorflow model

Story Four: Converting Yolo Darknet model to Caffe model

Story Five: Preparing Ultra96 Vitis-AI DPU Files

Story Six: Compiling Kernel

Story Seven: Running on Ultra96

Schematics

vivado hardware platform design

Code

Yolo program on ultra96

Densenet training bat program

Kernel compiling program

Densenet program on ultra96

python link library

Ultra96 DPU prj_config

DPU Ultra96.json

Enet program on ultra96

Densenet training python program

Credits

john wang

chu wang

Charlie XU

Comments

Related channels and tags