Created June 24, 2018 © GPL3+

Robotic Assistant with Search and Rescue Capability

This design will perform the functions of several types of service dog. AI and sensors will provide emergency search and rescue.skills.

AdvancedProtipOver 1 day199

Robotic Assistant with Search and Rescue Capability

Things used in this project

Hardware components

Raspberry Pi 3 Model B

Exceed RC car

servo PWM driver board

Story

I have been steadily improving this robot platform for a few years.

The most recent work added a Walabot radar sensor for a robotic assistant project.

https://www.hackster.io/user462411/walabot-security-robot-with-alexa-command-and-control-c979e6

Since I am always looking for new adventures, I found the Autonomous Robot Challenge presented an opportunity to push my robot platform to greater usefulness. I am excited to add AI features to help perform a autonomous search and rescue behavior. This will be used to classify data from a camera, microphone , Walabot radar sensor and infrared temperature sensor.

Introduction to Neural Networks

Most of what I know about neural networks comes from this free online book: Neural networks and deep learning by Michael Nielsen. Without the book I would have been almost completely lost trying to figure out how the donkeycar software works and other problems involved in this project. The brevity and completeness of the book reminds me of "The C Programming Language" by Kernighan and Ritchie. Reading a few chapters of the book is a good introduction to what makes the donkeycar software work.

I have been moved

I moved my place of residence during this project. This was quite a disturbance to workflow, thought processes and overall progress. The new place is a better place to do project work and for that I am thankful.

Donkeycar as a first working example.

The donkeycar project serves as a great example for future work with neural networks. There is simply not enough project time to start work from a completely blank project. It is hoped that the process of adding features to the donkeycar project will teach neural network theory and practice quickly.

Adding More Neural Network power to the Project

I will show what the Ultra96 board can do to add more neural network processing power to the design. I will be benchmarking key pieces of neural network code using just the quad core A53 ARM processor and comparing the code execution time to code execution time with a system accelerated by FPGA fabric used in conjunction with the same quad core ARM A53 processor.

Testing Donkeycar on the bench

Testing the steering and throttle

Building Donkeycar

1 / 2 • Driving Webpage

Modifying Donkeycar software

The Donkeycar software as downloaded is a stateless design, The software does not develop a sense of location on the racetrack, not in a variable that we can use directly. This is a bit of a challenge since I want to deliver a package on a certain section of the racetrack. What should I do about this?

Options as I see them:

Install a process that watches a GPS and drops the package at the right spot on the racetrack. Probably the easiest option.
Create a parallel neural network to signal the drop the package, to run in parallel with the throttle and steering angle neural networks. Probably the second easiest option. There may be a need to filter a drop the package signal as it could be noisy in a stateless design - unless a special sign is used as a drop cue.
Heavily modify the Donkey Car software to develop a sense of where donkeycar is along the racetrack. This is the most difficult option but the one with the greatest advantage with respect to creating an autonomous robot.

Ultra96 board "hello world"

First output from the Ultra96 board

The first step with the Ultra96 board is to get started with the design tools. Lucky for us there is a good introduction project. Thanks Adam Taylor!

Intoduction to Ultra96 board and design tools https://www.hackster.io/adam-taylor/accelerating-your-ultra96-developments-806a72

This project code will be modified to benchmark key pieces of neural network code. An example of such code is the activation function.

logistic function

Logistic function graph

This function and others like it are useful to computing neural networks. The derivative of the function is needed for training the neural network. Several techniques will be demonstrated to speed the calculation. Several floating point operations would be needed for most simple implementations of the exponential function. Deep learning makes us want this to be very fast!

Logistic Function Test1

In this first test the logistic function is put into the previously used "hello world" example matrix multiply project where it would go in neural network calculations. The matrix multiply example project is copied and the logistic function is added to the accelerated and non-accelerated code.

First the mmult.cpp file is changed on line 78 to add the logistic function.

void mmult_accel(float A[N*N], float B[N*N], float C[N*N]) 
{
    float _A[N][N], _B[N][N];
#pragma HLS array_partition variable=_A block factor=8 dim=2
#pragma HLS array_partition variable=_B block factor=8 dim=1
    for(int i=0; i<N; i++) {
         for(int j=0; j<N; j++) {
#pragma HLS PIPELINE
              _A[i][j] = A[i * N + j];
              _B[i][j] = B[i * N + j];
         }
    }
    for (int i = 0; i < N; i++) {
         for (int j = 0; j < N; j++) {
#pragma HLS PIPELINE
              float result = 0;
              for (int k = 0; k < N; k++) {
                   float term = _A[i][k] * _B[k][j];
                   result += term;
              }
              C[i * N + j] = 1.0/(1.0 + expf(-result));
         }
    }
}

Second main.cpp is changed on line 88 to add the logistic function.

void mmult_golden(float *A,  float *B, float *C)
{
    for (int row = 0; row < N; row++) {
         for (int col = 0; col < N; col++) {
              float result = 0.0;
              for (int k = 0; k < N; k++) {
                   result += A[row*N+k] * B[k*N+col];
              }
              C[row*N+col] = 1.0/(1.0 + expf(-result)); // non-accelerated logistic function
         }
    }
}

Logistic Function Test1

The test results:

104,658 processor clock cycles were added by computing the logistic function in the non-accelerated code.

310 processor clock cycles were added by computing the logistic function in the accelerated portion of the code.

The logistic function portion of the code has been sped up 337 times!

More benefit to speeding up the code further would be to focus on the matrix multiply. There is not much there to shrink on the accelerated logistic function!

DropBox link for the zipped up archive of the SDSoC project

Making a long story short

In the future I will use a lookup table method to speed the processing of the activation function. This means the index to the table will be an integer. This also means in the future I will experiment with using integer arithmetic for most of my neural network code. There will be development points early on where I use a mixed system of floating point and fixed point arithmetic but I suspect benchmarking and other practical experience will lead preferring to mostly fixed point systems. If this sounds like intuition or guesswork to you then I call you very wise indeed.

In the scientific literature on neural networks there is much discussion on the merits of fixed point versus floating point neural networks. Work on binary neural networks further reduces the values used in neural networks to -1 and +1 ( implemented in binary circuit 1s and 0s with which we may be more familiar). I will not try to settle the argument or take sides in these discussions.

Going Further

Things you will want to know about developing with floating point

Things you will want to know about developing with fixed point

Code

#include <stdio.h>
#include <stdlib.h>

#include "mmult.h"
// HLS Math Functions
//#include "hls_math.h"// HLS Math Functions
#include <math.h>
/**
 *
 * Design principles to achieve II = 1
 * 1. Stream data into local RAM for inputs (multiple access required)
 * 2. Partition local RAMs into N/2 sub-arrays for fully parallel access (dual-port read)
 * 3. Pipeline the dot-product loop, to fully unroll it
 * 4. Separate multiply-accumulate in inner loop to force two FP operators
 *
 */
void mmult_accel(float A[N*N], float B[N*N], float C[N*N]) 
{
     float _A[N][N], _B[N][N];
#pragma HLS array_partition variable=_A block factor=8 dim=2
#pragma HLS array_partition variable=_B block factor=8 dim=1
     
     for(int i=0; i<N; i++) {
          for(int j=0; j<N; j++) {
#pragma HLS PIPELINE
               _A[i][j] = A[i * N + j];
               _B[i][j] = B[i * N + j];
          }
     }
     
     for (int i = 0; i < N; i++) {
          for (int j = 0; j < N; j++) {
#pragma HLS PIPELINE
               float result = 0;
               for (int k = 0; k < N; k++) {
                    float term = _A[i][k] * _B[k][j];
                    result += term;
               }
               C[i * N + j] = 1.0/(1.0 + /*hls::*/expf(-result));
          }
     }
}

#include <iostream>
#include <stdlib.h>
#include <stdint.h>

#include "sds_lib.h"
#include "mmult.h"



// Non-HLS Math for benchmark purposes (non-accelerated versions of functions)
#include <math.h>

// Fixed point support experiments soon
//#include <ap_fixed.h>

#define NUM_TESTS 1024

class perf_counter
{
public:
     uint64_t tot, cnt, calls;
     perf_counter() : tot(0), cnt(0), calls(0) {};
     inline void reset() { tot = cnt = calls = 0; }
     inline void start() { cnt = sds_clock_counter(); calls++; };
     inline void stop() { tot += (sds_clock_counter() - cnt); };
     inline uint64_t avg_cpu_cycles() { return (tot / calls); };
};

static void init_arrays(float *A,  float *B, float *C_sw, float *C)
{
     for (int i = 0; i < N; i++) {
          for (int j = 0; j < N; j++) {
               A[i * N + j] = 1+i*N+j;
               B[i * N + j] = rand() % (N * N);
               C_sw[i * N + j] = 0.0;
               C[i * N + j] = 0.0;
          }
     }
}

void mmult_golden(float *A,  float *B, float *C)
{
     for (int row = 0; row < N; row++) {
          for (int col = 0; col < N; col++) {
               float result = 0.0;
               for (int k = 0; k < N; k++) {
                    result += A[row*N+k] * B[k*N+col];
               }
               C[row*N+col] = 1.0/(1.0 + expf(-result)); // non-accelerated logistic function
          }
     }
}

static int result_check(float *C, float *C_sw)
{
     for (int i = 0; i < N * N; i++) {
          if (C_sw[i] != C[i]) {
               std::cout << "Mismatch: data index=" << i << "d=" << C_sw[i] 
                         << ", dout=" << C[i] << std::endl;
               return 1;
          }
     }
     return 0;
}

int mmult_test(float *A,  float *B, float *C_sw, float *C)
{
     std::cout << "Testing " << NUM_TESTS << " iterations of " << N << "x" << N 
               << " floating point mmult..." << std::endl;

     perf_counter hw_ctr, sw_ctr;
     
     for (int i = 0; i < NUM_TESTS; i++) 
     {
          init_arrays(A, B, C_sw, C);

          sw_ctr.start();
          mmult_golden(A, B, C_sw);
          sw_ctr.stop();

          hw_ctr.start();
          mmult_accel(A, B, C);
          hw_ctr.stop();

          if (result_check(C, C_sw))
               return 1;
     }
     uint64_t sw_cycles = sw_ctr.avg_cpu_cycles();
     uint64_t hw_cycles = hw_ctr.avg_cpu_cycles();
     double speedup = (double) sw_cycles / (double) hw_cycles;

     std::cout << "Average number of CPU cycles running mmult in software: "
               << sw_cycles << std::endl;
     std::cout << "Average number of CPU cycles running mmult in hardware: "
               << hw_cycles << std::endl;
     std::cout << "Speed up: " << speedup << std::endl;

     return 0;
}

/**
 * Design principles to achieve performance
 *
 * 1. sds_alloc to guarantee physically contiguous buffer allocation
 *    that enables the most efficient DMA configuration (axidma_simple)
 */
int main(int argc, char* argv[]){
     int test_passed = 0;
     float *A, *B, *C_sw, *C;

     A = (float *)sds_alloc(N * N * sizeof(float));
     B = (float *)sds_alloc(N * N * sizeof(float));
     C = (float *)sds_alloc(N * N * sizeof(float));
     C_sw = (float *)sds_alloc(N * N * sizeof(float));
     
     if (!A || !B || !C || !C_sw) {
          if (A) sds_free(A);
          if (B) sds_free(B);
          if (C) sds_free(C);
          if (C_sw) sds_free(C_sw);
          return 2;
     }
     
     test_passed = mmult_test(A, B, C_sw, C);
     
     std::cout << "TEST " << (test_passed ? "FAILED" : "PASSED") << std::endl;

     sds_free(A);
     sds_free(B);
     sds_free(C);
     sds_free(C_sw);
     
     return (test_passed ? -1 : 0);
}

Credits

Tom Minnich

19 projects • 81 followers

Embedded software guy for a long time

Robotic Assistant with Search and Rescue Capability

Things used in this project

Hardware components

Story

Schematics

RPi DonkeyCar connections

Code

mmult.cpp with logistic function added Test1

main.cpp for logistic function Test 1

Credits

Tom Minnich

Comments

Embed the widget on your own site

Robotic Assistant with Search and Rescue Capability

Robotic Assistant with Search and Rescue Capability

Things used in this project

Hardware components

Story

Schematics

RPi DonkeyCar connections

Code

mmult.cpp with logistic function added Test1

main.cpp for logistic function Test 1

Credits

Tom Minnich

Comments

Related channels and tags