Smart Sensing on the NXP Rapid IoT Kit (ARM NN at the Edge)

Makes your sensor edge node smarter by integrating machine learning Gated Recurrent Unit (GRU) feature for time series anomaly detection.

101

Smart Security/Identification

Revolutionize Your IoT Prototyping

Smart Sensing on the NXP Rapid IoT Kit (ARM NN at the Edge)

Things used in this project

Hardware components

NXP Rapid IoT Prototyping Kit

NXP Hexiwear

NXP FRDM Board

Story

The goal of the project is to create the ability to run a recurrent neural network (RNN) model on a Cortex M4 grade CPU, and integrate that functionality into an environmental sensing node to analyze the time series from a sensor on that node. The NXP K64 MCU in the Rapid IoT kit provides a good reference board to try out the concept. The analytics from the RNN can provide the ability to predict the future time series values and thereby the ability to provide early warnings for disaster events.

The exploration started with the web-based IDE for the kit but we switched to the desktop IDE so that we can add the neural network feature as the data came in from the sensors. We developed the Gated Recurrent Unit (GRU) using the available ARM CMSIS NN library that is optimized for the Cortex M series. The GRU code was developed using the FRDM K64 board.

The process consisted of developing the GRU model in an Azure Jupyter notebook using Keras where it was easy to change the model parameters and train the model. The next challenge was to import the weights and bias values of the trained model into a set of #DEFINE statements that could be compiled to create the ARM CMSIS NN GRU implementation for the Cortex M4. The Keras tensors had to reordered and restructured to meet the requirements of the CMSIS NN. That was first tested using a set of Excel macros and then finally coded as a c program that read the .h5 file and wrote the #DEFINE statements into a .h include file that was compiled on the FRDM board. The validity of that include file was tested by writing a small driver to create the same input data on the FRDM board as the GRU model was executed.

The goal for the testing was to be able predict a time-series that can mimics the daily temperature variation which is a single hump curve that repeats every 24 hours. The Keras model showed excellent ability to capture that behavior with just two GRU cells in a single GRU layer that was coupled with a dense layer. The ARM CMSIS NN leverages the q format for storing floats and we modified the code provided by ARM to convert our floating point weights from Keras into the q format numbers for CMSIS.

Then we tested whether the FRDM board generated the "single hump" curve for every 24 data points, and IT DID! That was the most exciting moment in the entire project execution. As far as we know, we are the very first group of people to implement a GRU on the ARM Cortex. The existing work only shows the CNN implementations for image recognition (like the CIFAR cats and dogs stuff). The sample GRU code provided by ARM used some random numbers to populated the weights, and also may have some bugs. We hope to interact with ARM CMSIS NN team to get some clarification.

The last piece of the puzzle was to modify the "example code for weather station" provided by NXP to include the GRU functionality. That was not too difficult after we figured out where in "sensors.c" the call the our GRU code had to be added. We decided to override the temperature reading by getting the "single-hump" data generated by the GRU and send that via bluetooth to the Android Weather Station Demo app (and forward into the cloud). That worked out just fine.

Our cover-page shows the "cloud output" and rest of the hardware.

It turned our the our idea of using GRU for time-series analysis was explored by a Korean team (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5712838/) in 2017 but where the GRU was executed in the cloud. We believe that we are first research team to explore GRU on an edge device.

This work supplements and complements the work done by the NXP team where two models can be trained: SVM (Support Vector Model) or GMM (Gaussian Mixture Model). They used the Rapid IoT Kit for anomaly detection on an electric drill where they mention that : "A few of Rapid IoT's sensors are utilized- gyro/accel/mag, Rapid IoT is trained to send an alert when an out of bound condition occurs, and the same principle of anomaly detection can be applicable for other verticals- e.g. temperature, humidity, air quality for Home and Building automation."

Same comments will apply to the GRU tool developed by our team with the exception that model training will take place externally. On the other hand, it is extremely easy to incorporate our code into the existing Rapid IoT Sensor Examples. There is also a great deal of existing research on using GRU's for speech recognition, and it will be interesting to add a microphone to the Rapid IoT kit and test ability of the kit to do real-time speech recognition.

Code

GRU implementation using ARM CMSIS-NN

/*
 * Copyright 2016-2018 NXP Semiconductor, Inc.
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without modification,
 * are permitted provided that the following conditions are met:
 *
 * o Redistributions of source code must retain the above copyright notice, this list
 *   of conditions and the following disclaimer.
 *
 * o Redistributions in binary form must reproduce the above copyright notice, this
 *   list of conditions and the following disclaimer in the documentation and/or
 *   other materials provided with the distribution.
 *
 * o Neither the name of NXP Semiconductor, Inc. nor the names of its
 *   contributors may be used to endorse or promote products derived from this
 *   software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
 * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
 * ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */
 
/**
 * @file    MK64FN1M0xxx12_A.c
 * @brief   Application entry point.
 */
#include <stdio.h>
#include "board.h"
#include "peripherals.h"
#include "pin_mux.h"
#include "clock_config.h"
#include "MK64F12.h"
#include "fsl_debug_console.h"


/**
 * @defgroup GRUExample Gated Recurrent Unit Example
 *
 * \par Description:
 * \par
 * Demonstrates a gated recurrent unit (GRU) example with the use of fully-connected,
 * Tanh/Sigmoid activation functions.
 *
 * \par Model definition:
 * \par
 * GRU is a type of recurrent neural network (RNN). It contains two sigmoid gates and one hidden
 * state.
 * \par
 * The computation can be summarized as:
 * <pre>z[t] = sigmoid( W_z &sdot; {h[t-1],x[t]} )
 * r[t] = sigmoid( W_r &sdot; {h[t-1],x[t]} )
 * n[t] = tanh( W_n &sdot; [r[t] &times; {h[t-1], x[t]} )
 * h[t] = (1 - z[t]) &times; h[t-1] + z[t] &times; n[t] </pre>
 * \image html GRU.gif "Gate Recurrent Unit Diagram"
 *
 * \par Variables Description:
 * \par
 * \li \c update_gate_weights, \c reset_gate_weights, \c hidden_state_weights are weights corresponding to update gate (W_z), reset gate (W_r), and hidden state (W_n).
 * \li \c update_gate_bias, \c reset_gate_bias, \c hidden_state_bias are layer bias arrays
 * \li \c test_input1, \c test_input2, \c test_history are the inputs and initial history
 *
 * \par
 * The buffer is allocated as:
 * \par
 * | reset | input | history | update | hidden_state |
 * \par
 * In this way, the concatination is automatically done since (reset, input) and (input, history)
 * are physically concatinated in memory.
 * \par
 *  The ordering of the weight matrix should be adjusted accordingly.
 *
  *
 *
 * \par CMSIS DSP Software Library Functions Used:
 * \par
 * - arm_fully_connected_mat_q7_vec_q15_opt()
 * - arm_nn_activations_direct_q15()
 * - arm_mult_q15()
 * - arm_offset_q15()
 * - arm_sub_q15()
 * - arm_copy_q15()
 *
 * <b> Refer  </b>
 * \link arm_nnexamples_gru.cpp \endlink
 *
 */
/* TODO: insert other include files here. */
#include <math.h>
#include <MK64FN1M0xxx12_A.h>
//#include <arm_nnexamples_gru_test_data.h>
#include "arm_math.h"
#include "arm_nnfunctions.h"
/* TODO: insert other definitions and declarations here. */
#ifdef _RTE_
#include "RTE_Components.h"
#ifdef RTE_Compiler_EventRecorder
#include "EventRecorder.h"
#endif
#endif

//#define DIM_HISTORY 32
#define DIM_HISTORY 8
//#define DIM_INPUT 32
#define DIM_INPUT 8
//#define DIM_VEC 64
#define DIM_VEC 16
//#define DIM_DENSE 32
#define DIM_DENSE 8


#define USE_X4

#ifndef USE_X4
static q7_t update_gate_weights[DIM_VEC * DIM_HISTORY] = UPDATE_GATE_WEIGHT_X2;
static q7_t reset_gate_weights[DIM_VEC * DIM_HISTORY] = RESET_GATE_WEIGHT_X2;
static q7_t hidden_state_weights[DIM_VEC * DIM_HISTORY] = HIDDEN_STATE_WEIGHT_X2;
#else
/*
static q7_t update_gate_weights[DIM_VEC * DIM_HISTORY] = UPDATE_GATE_WEIGHT_X4;
static q7_t reset_gate_weights[DIM_VEC * DIM_HISTORY] = RESET_GATE_WEIGHT_X4;
static q7_t hidden_state_weights[DIM_VEC * DIM_HISTORY] = HIDDEN_STATE_WEIGHT_X4;
*/
static q15_t update_gate_weights[DIM_VEC * DIM_HISTORY] = UPDATE_GATE_WEIGHT_X4;
static q15_t reset_gate_weights[DIM_VEC * DIM_HISTORY] = RESET_GATE_WEIGHT_X4;
static q15_t hidden_state_weights[DIM_VEC * DIM_HISTORY] = HIDDEN_STATE_WEIGHT_X4;
static q15_t dense_layer_weights[DIM_DENSE] = DENSE_LAYER_WEIGHT_X4;
#endif
static float32_t update_gate_weights_f[DIM_VEC * DIM_HISTORY];
static float32_t reset_gate_weights_f[DIM_VEC * DIM_HISTORY];
static float32_t hidden_state_weights_f[DIM_VEC * DIM_HISTORY];
static float32_t dense_layer_weights_f[DIM_DENSE];
/*
static q7_t update_gate_bias[DIM_HISTORY] = UPDATE_GATE_BIAS;
static q7_t reset_gate_bias[DIM_HISTORY] = RESET_GATE_BIAS;
static q7_t hidden_state_bias[DIM_HISTORY] = HIDDEN_STATE_BIAS;
*/
static q15_t update_gate_bias[DIM_HISTORY] = UPDATE_GATE_BIAS;
static q15_t reset_gate_bias[DIM_HISTORY] = RESET_GATE_BIAS;
static q15_t hidden_state_bias[DIM_HISTORY] = HIDDEN_STATE_BIAS;

static q15_t test_input1[DIM_INPUT] = INPUT_DATA1;
//static q15_t test_input2[DIM_INPUT] = INPUT_DATA2;
static q15_t test_history[DIM_HISTORY] = HISTORY_DATA;

q15_t     scratch_buffer[DIM_HISTORY * 4 + DIM_INPUT];
float32_t scratch_buffer_f[DIM_HISTORY * 4 + DIM_INPUT];

static q15_t unity[DIM_DENSE]= UNITY_X4;

/*
void gru_example(q15_t * scratch_input, uint16_t input_size, uint16_t history_size,
                 q7_t * weights_update, q7_t * weights_reset, q7_t * weights_hidden_state,
                 q7_t * bias_update, q7_t * bias_reset, q7_t * bias_hidden_state)
*/
q63_t gru_example(q15_t * scratch_input, uint16_t input_size, uint16_t history_size,
                 q15_t * weights_update, q15_t * weights_reset, q15_t * weights_hidden_state,
                 q15_t * bias_update, q15_t * bias_reset, q15_t * bias_hidden_state)
{
  q15_t    *reset = scratch_input;
  q15_t    *input = scratch_input + history_size;
  q15_t    *history = scratch_input + history_size + input_size;
  q15_t    *update = scratch_input + 2 * history_size + input_size;
  q15_t    *hidden_state = scratch_input + 3 * history_size + input_size;
  q63_t    out63;
  q31_t    out31;
  // reset gate calculation
  // the range of the output can be adjusted with bias_shift and output_shift
#ifndef USE_X4
  arm_fully_connected_mat_q7_vec_q15(input, weights_reset, input_size + history_size, history_size, 0, 15, bias_reset,
                                     reset, NULL);
#else
/*
  arm_fully_connected_mat_q7_vec_q15_opt(input, weights_reset, input_size + history_size, history_size, 0, 15,
                                         bias_reset, reset, NULL);
*/
  arm_fully_connected_q15_opt(input, weights_reset, input_size + history_size, history_size, 0, 15,
                                         bias_reset, reset, NULL);
#endif
  // sigmoid function, the size of the integer bit-width should be consistent with out_shift
  arm_nn_activations_direct_q15(reset, history_size, 0, ARM_SIGMOID);
  arm_mult_q15(history, reset, reset, history_size);

  // update gate calculation
  // the range of the output can be adjusted with bias_shift and output_shift
#ifndef USE_X4
  arm_fully_connected_mat_q7_vec_q15(input, weights_update, input_size + history_size, history_size, 0, 15,
                                     bias_update, update, NULL);
#else
/*
  arm_fully_connected_mat_q7_vec_q15_opt(input, weights_update, input_size + history_size, history_size, 0, 15,
                                         bias_update, update, NULL);
*/
  arm_fully_connected_q15_opt(input, weights_update, input_size + history_size, history_size, 0, 15,
                                         bias_update, update, NULL);
#endif

  // sigmoid function, the size of the integer bit-width should be consistent with out_shift
  arm_nn_activations_direct_q15(update, history_size, 0, ARM_SIGMOID);

  // hidden state calculation
#ifndef USE_X4
  arm_fully_connected_mat_q7_vec_q15(reset, weights_hidden_state, input_size + history_size, history_size, 0, 15,
                                     bias_hidden_state, hidden_state, NULL);
#else
/*
  arm_fully_connected_mat_q7_vec_q15_opt(reset, weights_hidden_state, input_size + history_size, history_size, 0, 15,
                                         bias_hidden_state, hidden_state, NULL);
*/
  arm_fully_connected_q15_opt(reset, weights_hidden_state, input_size + history_size, history_size, 0, 15,
                                         bias_hidden_state, hidden_state, NULL);
  #endif

  // tanh function, the size of the integer bit-width should be consistent with out_shift
  arm_nn_activations_direct_q15(hidden_state, history_size, 0, ARM_TANH);
//  arm_mult_q15(update, hidden_state, hidden_state, history_size);

  // we calculate z - 1 here
  // so final addition becomes substraction
/*
  arm_offset_q15(update, 0x8000, update, history_size);
  // multiply history
  arm_mult_q15(history, update, update, history_size);
  // calculate history_out
  arm_sub_q15(hidden_state, update, history, history_size);
*/
  // multiply history
  arm_mult_q15(history, update, reset, history_size);
  arm_offset_q15(update, 0x8000, update, history_size);
  arm_mult_q15(update, hidden_state, hidden_state, history_size);
  arm_sub_q15(reset,hidden_state, history, history_size);
  arm_mult_q15(history,dense_layer_weights,reset,history_size);
  out63=0;
  arm_dot_prod_q15(reset,unity,(uint32_t) history_size,&out63);
  out31=clip_q63_to_q31(out63);
  arm_q15_to_float(scratch_buffer,scratch_buffer_f,DIM_HISTORY * 4 + DIM_INPUT);
  out31+= DENSE_BIAS;
  return out31;
}

int j;
float32_t v[1];

/*
 * @brief   Application entry point.
 */

int main(void) {

  	/* Init board hardware. */
    BOARD_InitBootPins();
    BOARD_InitBootClocks();
    BOARD_InitBootPeripherals();
  	/* Init FSL debug console. */
    BOARD_InitDebugConsole();

    PRINTF("Hello World\n");

	#ifdef RTE_Compiler_EventRecorder
	EventRecorderInitialize (EventRecordAll, 1);  // initialize and start Event Recorder
	#endif

	printf("Start GRU execution\n");
	int       input_size = DIM_INPUT;
	int       history_size = DIM_HISTORY;
	q31_t gruout31[1];
	float32_t fgruout[1];
	float32_t pred;
	// copy over the input data
	arm_copy_q15(test_input1, scratch_buffer + history_size, input_size);
	arm_copy_q15(test_history, scratch_buffer + history_size + input_size, history_size);
/*
	foo= gru_example(scratch_buffer, input_size, history_size,
				update_gate_weights, reset_gate_weights, hidden_state_weights,
				update_gate_bias, reset_gate_bias, hidden_state_bias);
	printf("Complete first iteration on GRU\n");

	arm_copy_q15(test_input2, scratch_buffer + history_size, input_size);
	foo= gru_example(scratch_buffer, input_size, history_size,
				update_gate_weights, reset_gate_weights, hidden_state_weights,
				update_gate_bias, reset_gate_bias, hidden_state_bias);
	printf("Complete second iteration on GRU\n");
*/
    /* Force the counter to be placed into memory. */
    volatile static int i = 0 ;
    /* Enter an infinite loop, just incrementing a counter. */
    arm_q15_to_float(update_gate_weights,update_gate_weights_f,DIM_VEC * DIM_HISTORY);
    arm_q15_to_float(reset_gate_weights,reset_gate_weights_f,DIM_VEC * DIM_HISTORY);
    arm_q15_to_float(hidden_state_weights,hidden_state_weights_f,DIM_VEC * DIM_HISTORY);
    arm_q15_to_float(dense_layer_weights,dense_layer_weights_f,DIM_DENSE);
    while(1) {
    	gruout31[0]= gru_example(scratch_buffer, input_size, history_size,
    				update_gate_weights, reset_gate_weights, hidden_state_weights,
    				update_gate_bias, reset_gate_bias, hidden_state_bias);
    	arm_q31_to_float(gruout31,fgruout,1);
    	printf("%d\n",(int32_t) (fgruout[0]*1000.0));
//    	printf("Completed another iteration on GRU\n");
    	for (j=7;j>0;j--)
    	{
    		test_input1[j]=test_input1[j-1];
    	}
    	v[0]=sin(2*3.14159*i/8);
    	arm_float_to_q15(v,test_input1,1);
    	arm_copy_q15(test_input1, scratch_buffer + history_size, input_size);
        i++ ;
    }
    return 0 ;
}