About this project
In this tutorial, we will show you how to achieve the search and follow Mode of the Smart Robot Car and make it follow the target to move forward. For this project apart from ESP32 Cam, we will only need a two (or four-wheeled) robot with appropriate motor drivers and Arduino Uno board
Code
How to find center of face
Else
Command for motor to turn
Testing the code
Conclusion

Published March 30, 2023 © CERN-OHL

Tracking and Search Function with ESP32 and arduino robot

Design and Building of a Smart Platform with ESP32 Cam to achieve the Mode Tracking for search and Following the Face

AdvancedFull instructions provided8 hours3,727

Tracking and Search Function with ESP32 and arduino robot

Things used in this project

Hardware components

M5Stack ESP32 Camera Module Development Board

Software apps and online services

Arduino IDE

Hand tools and fabrication machines

Multitool, Screwdriver

Story

About this project

In this tutorial, we will show you how to achieve the search and follow Mode of the Smart Robot Car and make it follow the target to move forward. For this project apart from ESP32 Cam, we will only need a two (or four-wheeled) robot with appropriate motor drivers and Arduino Uno board.

The main objective of this project is to perform detection and tracking of faces from the real-time input video. The tracking of the face is done with the help of Arduino uno microcontroller. The microcontroller arduino is connected to two servo motors and two Drive motors. The servos are centered before tracking begins. The coordinates obtained from the bounding box in ESP32 is used to track the face in the subsequent frames. The servos control panning and tilting the webcam mounted on it. The webcam’s position changes according to the movement of the object or person.

Multi-Functional 2WD driving Straight Robot Car with ESP32

Design and Building

The input video stream is obtained using an ESP32-CAM. The ESP-WHO framework takes QVGA (320×240) images as input. Face detection is implemented using MTCNN and MobileNet, and will return the position of any faces in the image if present. Each frame is examined for a face. It is operative only on frontal faces. Once the face is spotted, a bounding box is drawn around it.

The coordinates of the box obtained after detecting the face in a frame is written onto the ESP32 microcontroller.

ESP32 equipped with other sensors such as ultrasonic sensors, IR etc. does not have enough pins to control 2 servos and 2 motors. In addition, the image processing puts a heavy load on the ESP32. In the end I decided on a combination of 2 processors. One component is the Arduino platform with 2 motors and 2 servos and ESP32 as the input video stream source, as the picture below shows.

Concept of two modules working together

Aside from the fact that the program takes at least 200 ms for face recognition, any further delay would be undesirable, which was another reason for an additional processor. The second processor takes full control of the task over the servo as well as the vehicle movement.

The ESP32-CAM can be programmed using the Arduino IDE, which supports the ESP32 platform.

For those who have little or no experience with the ESP32 camera development boards can start with the following tutorial: ESP32-CAM Video Streaming and Face Recognition with Arduino IDE

In my previous projects were building a remote control for car robots and Robot Arm. Detailed description of how to develop the project is here. To control the movement of the robot car, it should be connected to a platform with the motor driver or other motor controllers that can be controlled by the ESP32-CAM or a Bluetooth. A detailed description can be found on the page here.

Code

Although tremendous strides have been made in face recognition, one of the remaining open challenges is how to achieve real-time speed on the CPU and maintain high performance, as effective face recognition models tend to be computationally intensive. To meet this challenge, I propose a novel concept that offers superior performance in terms of both speed and accuracy.

The code shown below was the minimum needed to be able to detect the images of the camera, so web server is not active because video streaming was not necessary.

When you build the model in C++ for the ESP you can use a lot of functions in the ESP-WHO library: esp_camera.h, the fd_forward.h and fb_gfx.h; that allows us to initialize and interact with the camera and provide the second function to detect faces in the image.

They help get your data into the right format and perform operations you need (look at the header files for names/explanations of the functions available) finally, once all that’s done you’ll want to make your inference.

If someone had problems and the program keeps rebooting, then the library can help:

#include "soc/soc.h" // disable brownout problems

#include "soc/rtc_cntl_reg.h" // disable brownout problems

How to find center of face

A function “called draw_face_boxes” that is normally used to provides a detected face, display a box around.

Using the function as a result we get coordinates for the top point on the left side of the box frame. As can be seen from the picture, the center of the box is different from the center of the picture. So camera should move up and right as shown by blue arrows on the picture. The X and Y co-ordinates of this box combined with its height and width can be used find the centre of the box and therefore the centre of the face.

Center of box is: x= 74 px, y= 82px

Center of face in frame:

X= 74+ half height/2= 74+ 100/2=134px

Y= 82+ half height/2 = 82+140/2=152px

The center of the frame for the recognized object and the center of the image are not in the same point, which is shown by blue arrows

To conversion from pixels to degrees, for QVGA (320px ×240px, diagonal 400px) divide diagonal with the view field of camera. I’m using a OV2640 camera module, which shipped together with my board ESP32 and has 60° (with Multiple Lens Options- Available FoV: 100°/120°/140°/170° ).

For my camera it gets the pixels per degree of rotation:

400/60= 6, 7

Now distance image center from the frame center converted into degrees is given with the following forms:

posH =posH + (160 - face_center_pan)/7;

What remains is to send the value of angle for servo via serial2.

Serial2.printf("H%d \n", posH);

A code can be added so that the motors and servos only activate when the face is outside the frame.

Due to servo movement

of 10° and 170° for panorama one must limit horizontal movement.

If the object is at the edge of the picture, then you can turn Module with motors. Code is as below.

As noted, when the module turn left or right, the position for the servo is corrected by 20 degrees, so instead of 10° we set it to 30°, which suits the corresponding turn.

We don't need correction for the servo that positions the camera in the vertical direction. Because of this, code is correspondingly simpler.

We will also define a function to take care of the camera initialization -initCamera(). We will then call this function on the Arduino setup. Among the multiple initialization parameters, we will set the frame size to QVGA, which is the recommended resolution for the face detection.

In part setup, we start by opening two serial connections, so we can output a message and commands, when we detect faces on the captured image.

After that we will call the mtmn_init_config function (we will be using the default configurations), which takes no arguments and returns a set of default MTN configurations, we can use right away to start detecting faces in the camera images.

mtmn_config_t mtmn_config = {0};

We will write the rest of our code in the Arduino main loop. The first thing we need to do is obtaining a camera frame. So, we start by declaring a variable that will hold a pointer to a struct of type camera_fb_t. This struct will hold a pointer to the buffer containing the actual image and also some metadata such as the width and the height of the image and the length of the buffer that contains it.

camera_fb_t * frame;

Then call the esp_camera_fb_get function to get an image from the camera, which we will store on our previously declared variable.

frame = esp_camera_fb_get();

This function takes no arguments and returns a pointer to a camera_fb_t struct, which we will store on our previously declared variable.

Then we call to the dl_matrix3du_alloc function (Deep Learning Library). As output, this function will return a pointer to the allocated matrix struct.

Additionally to working with this struct type, the detection function also expects the image to be in the RGB888 format. To convert the captured image to RGB888 format we will call the “fmt2rgb888” function, which will convert our original image (in JPEG) to the RGB888 format.

esp_camera_fb_return(frame);

This function call will allow the image buffer to be reused again, which makes sense since we will continuously grab new images and we don’t need to keep the old ones.

static void draw_face_boxes(dl_matrix3du_t *image_matrix, box_array_t *boxes)

Function called draw_face_boxes is used to display a box around a detected face.

box_array_t *boxes = face_detect(image_matrix, &mtmn_config);

A box_array_t type value contains face boxes, as well as score and landmark of each box: as coordinates: left top, right down, landmarks.

if (boxes != NULL) {

We just want to know if faces were detected in the image or not. We will simply check if this pointer is non-NULL and if not increment the noDetection counter, which is of interest to a "search function". If this pointer is different from NULL draw_face_boxes(image_matrix, boxes) sent command (as print) for move, tilt or pan camera.

Else

However, if there is no face to be recognized for a certain time, then you have to activate a function to search. As a timer for the search function, we use a variable "noDetection", which accumulates with each unsuccessful attempt at face detection. Action within "Else" is split so that search runs first on the side where the last mall face is located. Then set for to other side. Since the timer variable is "noDetection", each step follows in about 200-240 ms. If face recognition would have come in the meantime, the entire action within "Else" is cancelled.

Part of "Else" can be seen in the picture below.

If that is necessary, within "Else" you can also set another function like followObjekt(); ObjektSearch;

Command for motor to turn

From the Robot project Multi-Functional 2WD driving Straight Robot Car and “Remote control and Video Monitoring with ESP32 for Robot “we use a table for different commands.

A code can be added so that the motors and servos only activate when the face is outside the frame.

Buttons to move the car in Left, Right, Forward and reverse directions

To control a camera, we could change the commands, for example instead of "2" for left and "4" for down, would be a full command for servo to horizontal position or servo to vertical position, each followed by a number representing the indicates angular positions.

With the command we can send the exact position to both servos (panorama and tilt).

The main thing is that there are two more rows in the “Loop” area in “switch(getstr)”.

// V or H case string for Servo

case 'V': posV = Serial.parseInt();movToV();break; //now holds posV

case 'H': posH = Serial.parseInt();movToH();break; //now holds posH and mov()

default: break;

and of course there are two subroutines:

void movToV()

{

//++++++++++++++ A Head Vertikal turn to posV

//stp();

if (posV<EndDown) posH=EndDown;

if (posV>EndUp) posH=EndUp;

myservoV.write(posV);

delay(t); // delay 40/5ms（used to adjust the servo speed）

}

void movToH()

{

// ++++++++++++ A Head Horizontal turn to posH

//stp();

if (posH<10) posH=10;

if (posH>170) posH=170;

myservoH.write(posH);

delay(t); // delay 20/5ms（used to adjust the servo speed）

}

The program refers to DRV8835 H-Bridge. Since I would have to replace the DRV8835 H-Bridge later, I made a few small changes for the new module with L298N and then the old program fits without any changes. Schematics can be downloaded from the site as a Word document.

Testing the code

To test the code, compile it and upload it to your ESP32 to make sure it is correctly connected to the camera. Also connect pin 13-Rx and 15-Tx to Arduino platform and normal serial to FT232RL FTDI Programmer - USB adapter. Once the procedure finishes, open the Arduino IDE serial monitor.

1- Arduino IDE serial monitor shows result when an object is detected

Then point the camera at your face. You should see something similar to Figure (right side) in Serial Monitor. On serial monitor you can see that for each face detection timestamp was sent, as well as position in frame, followed by commands for servo H and V with corresponding angle.

Look an an example in the picture: ESP sends horizontal -H91 and vertical servo -V89 commands, and also the exact position of frame 150 point and time when the face edge is 2293 ms. If a face is detected in the captured images and if the Arduino Platform also responds correctly to commands, the camera tries to tracking a face.

If the face is not detected, then variable "noDetection" is incremented. As noted, if face is lost, servo first tries to move camera in the same direction. If there is no detection, camera keeps moving to other side. Movement is slower, about 1° for 40ms. In this case, however, ESP loses information about exact position. Therefore, a fixed position is set at certain points in time, which can help the ESP32 face recognition to quickly determine the position. The variable noDetection was used as a timer, since its increase follows after about 200-240ms. If Face detected, all commands in the "Else" will not be executed.

Conclusion

On the video, the robot looks like a "Big Brother", which doesn't sound sympathetic, since "Big Brother" stands for state control and intrusion into the life of the individual. But with appropriate make up, our robot looks more likeable. In any case, for this you need a bit of tinkering.

Unfortunately, “Follow Me” feature is only limited to Face Tracking, which is the only option for ESP-WHO framework. For a real search, OpenCV library would be more suitable. OpenCV.js runs in a browser which allows rapid trial of OpenCV functions by someone with only a modest background in HTML and JavaScript.

This brings many advantages, but would be a completely different concept. That's why we leave that for the next projects.

Code

Face_FollowMe23

//ESP32 camera: face detection
//and body movement
//how to detect faces using the ESP32 and a camera

#include "esp_camera.h"
#include "fd_forward.h"
#include "fb_gfx.h"
#include "soc/soc.h"             // disable brownout problems
#include "soc/rtc_cntl_reg.h"    // disable brownout problems

int  posH;
int  posV;

#define LED_BUILTIN 4 // No #define LED_PIN
 
#define PWDN_GPIO_NUM     32
#define RESET_GPIO_NUM    -1
#define XCLK_GPIO_NUM      0
#define SIOD_GPIO_NUM     26
#define SIOC_GPIO_NUM     27
#define Y9_GPIO_NUM       35
#define Y8_GPIO_NUM       34
#define Y7_GPIO_NUM       39
#define Y6_GPIO_NUM       36
#define Y5_GPIO_NUM       21
#define Y4_GPIO_NUM       19
#define Y3_GPIO_NUM       18
#define Y2_GPIO_NUM        5
#define VSYNC_GPIO_NUM    25
#define HREF_GPIO_NUM     23
#define PCLK_GPIO_NUM     22

//++++++++++++++++++++++++++++++++++++++++++++ function called draw_face_boxes that is used to display a box around a detected face
static void draw_face_boxes(dl_matrix3du_t *image_matrix, box_array_t *boxes)
{
  int x, y, w, h, i, half_width, half_height;  
  fb_data_t fb;
  fb.width = image_matrix->w;
  fb.height = image_matrix->h;
  fb.data = image_matrix->item;
  fb.bytes_per_pixel = 3;
  fb.format = FB_BGR888;
  for (i = 0; i < boxes->len; i++) {

    // finding face centre...
    x = ((int)boxes->box[i].box_p[0]);
    w = (int)boxes->box[i].box_p[2] - x + 1;
    half_width = w / 2;
    //++++++++++++++++++++++++++ Center PAN is 160 +++++++++++++++++++++++++++++++++++++++++++++++
    int face_center_pan = x + half_width; // image frame face centre x co-ordinate

    y = (int)boxes->box[i].box_p[1];
    h = (int)boxes->box[i].box_p[3] - y + 1;
    half_height = h / 2;
    posH =posH + (160 - face_center_pan)/7; //was 6,
    if (posH>170) 
    { 
       Serial2.println("B"); // serial2 -turn LEFT
       delay(100);
       Serial2.println("E"); // stop
       //posH=170;   //event 
       posH=150;
    }
    if (posH<20) 
     {
     Serial2.println("C"); // turn RIGHT
     delay(100);
     Serial2.println("E"); // stop
     //posH=10;    //event 
     posH=30;
    }
    
   Serial.printf("Center detected at %d dots\n", face_center_pan);  
   Serial2.printf("H%d \n", posH);
   Serial.printf("H%d \n", posH);
   
       
   
    //++++++++++++++++++++++++ Center TILT is 120 ++++++++++++++++++++++++++++++++++++++++++++++++++   
    int face_center_tilt = y + half_height;  // image frame face centre y co-ordinate
    posV =posV + (120 - face_center_tilt)/7;  //
    if (posV>130) 
    { 
      posV=130;   //LIMIT UP;
       }
    if (posV<50) 
     {
      posV=50;    //LIMIT DOWN
     }
     Serial2.printf("V%d \n", posV);
     Serial.printf("V%d \n", posV);
 
  } 
  }

 
 
mtmn_config_t mtmn_config = {0};
 // configurations of MTMN that will be used to detect the faces -
 //using the default configurations
 
int noDetection = 0;
  //define a variable that will count how many times we have not detected faces. 
  //We will initialize it with the value zero and increment it every time faces are not detected in a frame.

  //pins for communikation with arduino
  #define RXD2 13 //Rx for serial2
  #define TXD2 15 // Tx for serial2
      
void setup() {
  Serial.begin(115200);
    
   if (!initCamera()) {
 
    Serial.printf("Failed to initialize camera...");
    return;
  }
   
    posH=90;
    posV=90;

  Serial2.begin(9600, SERIAL_8N1, RXD2, TXD2);
  // Serial for ARDUINO auf  Pins Rx>13, TX>15
    delay(1000);
       if(Serial2){
       Serial.println("******Serial2 successfully set up******");
   } 
   Serial2.setDebugOutput(false);
   Serial.println("Face_FollowMe23");
   pinMode (LED_BUILTIN, OUTPUT); // initialize digital pin 4 LED_BUILTIN as an output.
  
  mtmn_config = mtmn_init_config();
 
 
}
 
void loop() {
  camera_fb_t * frame;
   
  frame = esp_camera_fb_get();    
  //call the esp_camera_fb_get function to get an image from the camera

  dl_matrix3du_t *image_matrix = dl_matrix3du_alloc(1, frame->width, frame->height, 3);
    //As output, this function will return a pointer to the allocated matrix struct.
    
    //to convert the captured image to RGB888 format . Sign "->" allows to access elements in Structures
  fmt2rgb888(frame->buf, frame->len, frame->format, image_matrix->item);
 
  esp_camera_fb_return(frame);
  //This function call will allow the image buffer to be reused again, 
 
  box_array_t *boxes = face_detect(image_matrix, &mtmn_config);
  //A box_array_t type value contains face boxes -coordinates: left top, right down, landmarks 
  
 
  if (boxes != NULL) {
    noDetection = 0;
    Serial2.println("E"); // arduino stop
    
    Serial.printf("Faces detected at %d \n", millis());
        
    draw_face_boxes(image_matrix, boxes);   //print command for move or pan camera
    dl_lib_free(boxes->score);
    dl_lib_free(boxes->box);
    dl_lib_free(boxes->landmark);
    dl_lib_free(boxes);
  }
  else
      {    
     noDetection = noDetection+1;
     Serial.printf("Faces not not detected %d times \n", noDetection);

     if (posH <=90) 
     {
    switch(noDetection){
    case 10: Serial2.println("3"); posH=60; break; // execute the corresponding function wenn after 5 times NOdetektion
    case 22: {Serial2.println("E");  //xxxxxxxxx
              posH = 20; Serial2.printf("H%d \n", posH);
              } 
              break;
    case 27: Serial2.println("2"); break;//
    case 31: {Serial2.println("E");    //xxxxxxxxx
              posH = 60; Serial2.printf("H%d \n", posH); 
              }  
              break;  
    case 36: Serial2.println("2"); break;//
    case 40: {
              Serial2.println("E");   //xxxxxxxxx
              posH = 91; Serial2.printf("H%d \n", posH);
              noDetection = 0; // stop
              }
              break;
       
    default:  break;
     }
      }
      
     if (posH>90) 
     {
    switch(noDetection){
    case 10: Serial2.println("2"); posH=140; break;   // execute the corresponding function wenn after 5 times NOdetektion
    case 22: {Serial2.println("E");   //xxxxxxxxx
              posH = 160; Serial2.printf("H%d \n", posH); 
              }
              break;
    case 27: Serial2.println("3"); break;//
    case 31: {Serial2.println("E");  //xxxxxxxxx
              posH = 120; Serial2.printf("H%d \n", posH);
              }
              break;
    case 36: Serial2.println("3"); break;//
    case 38: {Serial2.println("E");  //xxxxxxxxx
              posH = 90; Serial2.printf("H%d \n", posH);
              noDetection = 0; // stop
              }
              break;
    //case 56: digitalWrite(LED_BUILTIN, HIGH);   break; //event TURN on
    //ev. followObjekt(); ObjektSearch;
    
    default:  break;
     }
    }
         
    Serial.println("E? ---------------"); //boxes false - stop
        
   }
 
  dl_matrix3du_free(image_matrix);
  
 
}

//*******************************************************************
bool initCamera() {
 
  camera_config_t config;
 
  config.ledc_channel = LEDC_CHANNEL_0;
  config.ledc_timer = LEDC_TIMER_0;
  config.pin_d0 = Y2_GPIO_NUM;
  config.pin_d1 = Y3_GPIO_NUM;
  config.pin_d2 = Y4_GPIO_NUM;
  config.pin_d3 = Y5_GPIO_NUM;
  config.pin_d4 = Y6_GPIO_NUM;
  config.pin_d5 = Y7_GPIO_NUM;
  config.pin_d6 = Y8_GPIO_NUM;
  config.pin_d7 = Y9_GPIO_NUM;
  config.pin_xclk = XCLK_GPIO_NUM;
  config.pin_pclk = PCLK_GPIO_NUM;
  config.pin_vsync = VSYNC_GPIO_NUM;
  config.pin_href = HREF_GPIO_NUM;
  config.pin_sscb_sda = SIOD_GPIO_NUM;
  config.pin_sscb_scl = SIOC_GPIO_NUM;
  config.pin_pwdn = PWDN_GPIO_NUM;
  config.pin_reset = RESET_GPIO_NUM;
  config.xclk_freq_hz = 20000000;
  config.pixel_format = PIXFORMAT_JPEG;
  config.frame_size = FRAMESIZE_QVGA;
  config.jpeg_quality = 10;
  config.fb_count = 1;
 
  esp_err_t result = esp_camera_init(&config);
 
  if (result != ESP_OK) {
    return false;
  }
 
  return true;
}