OBSTACLE AVOIDING ROBOT USING AI/REINFORCEMENT LEARNING
How L298N Drives Two DC Motors
How HC-SR04 Sensor calculates distance
Important Terms in Reinforcement Learning
Hyperparameters in Reinforcement Learning
Working Video: Don't forget to check the working video of AI Robot :

Published August 4, 2020 © GPL3+

Obstacle Avoidance using Artificial Intelligence

This is AI-based Robot which learns on its own to avoid obstacles through experience:)

AdvancedFull instructions provided2 days14,170

Obstacle Avoidance using Artificial Intelligence

Things used in this project

Hardware components

SparkFun Arduino Pro Mini 328 - 5V/16MHz

Ultrasonic Sensor - HC-SR04 (Generic)

SparkFun Dual H-Bridge motor drivers L298

DC motor (generic)

60 RPM DC MOTORS

wheels

Jumper wires (generic)

SparkFun Solder-able Breadboard - Mini

Rechargeable Battery, Lithium Ion

7.2-volt Lithium-Ion Battery

9V battery (generic)

9V Battery Clip

PCBWay Custom PCB

C&K Switches JS Series Switch

Software apps and online services

Arduino IDE

Hand tools and fabrication machines

Soldering iron (generic)

Solder Wire, Lead Free

Solder Flux, Soldering

Desoldering Pump, Deluxe SOLDAPULLT®

Wire Stripper & Cutter, 18-10 AWG / 0.75-4mm² Capacity Wires

Hot glue gun (generic)

Tape, Double Sided

Story

OBSTACLE AVOIDING ROBOT USING AI/REINFORCEMENT LEARNING

Problem Statement: The main objective is to learn to avoid obstacles in “N” Episodes and learn the optimal action. In this case, let's assume we need our Robot to learn optimal action as 'Right'.

Reinforcement Algorithm Used: Q learning

How L298N Drives Two DC Motors:

For understanding this check this tutorial out at this link:https://howtomechatronics.com/tutorials/arduino/arduino-dc-motor-control-tutorial-l298n-pwm-h-bridge/#:~:text=one%20of%20them.-, L298N%20Driver, and%20explain%20how%20it%20works.

How HC-SR04 Sensor calculates distance:

Before going to the working of the project, it is important to understand how the ultrasonic sensor works. The basic principle behind the working of the ultrasonic sensor is as follows:

Using an external trigger signal, the Trig pin on the ultrasonic sensor is made logic high for at least 10µs. A sonic burst from the transmitter module is sent. This consists of 8 pulses of 40KHz.

The signals return back after hitting a surface and the receiver detects this signal. The Echo pin is high from the time of sending the signal and receiving it. This time can be converted to distance using appropriate calculations.

The aim of this project is to implement an obstacle avoiding robot using an ultrasonic sensor and Arduino. All the connections are made as per the circuit diagram. The working of the project is explained below.

When the robot is powered on, both the motors of the robot will run normally and the robot moves forward. During this time, the ultrasonic sensor continuously calculates the distance between the robot and the reflective surface.

This information is processed by the Arduino. If the distance between the robot and the obstacle is less than 15cm, the Robot stops and scans in the left and right directions for new distance using Servo Motor and Ultrasonic Sensor. If the distance towards the left side is more than that of the right side, the robot will prepare for a left turn. But first, it backs up a little bit and then activates the Left Wheel Motor in reversed in direction.

Similarly, if the right distance is more than that of the left distance, the Robot prepares the right rotation. This process continues forever and the robot keeps on moving without hitting any obstacle.

Important Terms in Reinforcement Learning:

1. STATE: This is the situation in which the Robot is. Here for a basic Obstacle Avoiding Robot, there are in total 2 states ……1st state is when there is no obstacle close to it and 2ndstate in which there is an obstacle in front of it.(when I wrote the code I assumed 10 different states can be in which expected the same action. The reason I did this to illustrate a more complex environment.)

2. ACTION: In a particular State the robot performs a particular action. There are 4 actions which the robot can perform in 2nd state: “FORWARD”, “BACKWARD”, “LEFT”, “STOP”. In 1st state, the robot can perform 4 actions but to make things easier I have assumed that robot can perform only one action which is “FORWARD”( This is because its lame to consider actions such as LEFT or BACKWARD when there are no obstacles nearby.

int ACTIONS = [0,1,2,3]

/* HERE :
0 = FORWARD
1 = BACKWARD
2 = STOP
3 = RIGHT*/

3. NEXT STATE: This is the state robot gets in when it performs a particular “ACTION” in its current “STATE”. In obstacle avoiding the robot case, the NEXT STATE can be either a “CRASHED” state or a “SURVIVED” State. (Here SURVIVE state is same as the starting state that robot is in when its episode starts.)

/*AFTER PERFOMING AN ACTION THE ROBOT GOES INTO NEXT STATE IN THIS CASE OF OBSTACLE 
AVOIDING ROBOT*/

int NEXT_STATE;
int STATE = 0;
NEXT_STATE = STATE+1;

4. Q TABLE / Q MATRIX: This table is formed by Number of “STATES” and Number of “ACTIONS”. In Obstacle avoidance Robot’s Case, this Table is given by:

float Q[N_STATES][NUMBER_OF_ACTIONS] = {{0.0,0.0,0.0,0.0}, 
                                        {0.0,0.0,0.0,0.0},  
                                        {0.0,0.0,0.0,0.0},   
                                        {0.0,0.0,0.0,0.0},   
                                        {0.0,0.0,0.0,0.0},  
                                        {0.0,0.0,0.0,0.0},
                                        {0.0,0.0,0.0,0.0},
                                        {0.0,0.0,0.0,0.0},
                                        {0.0,0.0,0.0,0.0},
                                        {0.0,0.0,0.0,0.0}};

Here N_STATES = 10 AND N_ACTIONS = 4. here "0.0" indicates that any action can be performed from any of the 4 possible actions. if you, however, want to eliminate a particulate action in a state just replace "0.0" with "-1.0" in the matrix. "-1.0" indicates that the action cannot be performed in that state. here it is assumed that we have 10 different states with each state expecting the same action. if you want your robot to learn actions that are different in each state then change the rewards from the reward matrix in the code

5. TERMINAL STATE: This is the last state in which the robot can be in. For obstacle Avoiding Robot, this state doesn’t exist as we don’t have any terminal state and want to keep our Robot learning forever.

6. REWARD MATRIX: This table or matrix is used to give rewards to the robot for certain actions. The reward is positive or negative depending upon the quality of the action.

int REWARDS[STATES][NUMBER_OF_ACTIONS] = {{-10,-2,-1,10}, 
                                          {-10,-2,-1,10}, 
                                          {-10,-2,-1,10}, 
                                          {-10,-2,-1,10},  
                                          {-10,-2,-1,10},
                                          {-10,-2,-1,10},
                                          {-10,-2,-1,10},
                                          {-10,-2,-1,10},
                                          {-10,-2,-1,10},
                                          {-10,-2,-1,10}};

7. ENVIRONMENT: This can also be assumed or considered as the world for the Robot. For example, we humans live on earth so basically earth is our environment.

To understand Reinforcement Learning better visit this link :https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/

Or Else visit this: https://en.wikipedia.org/wiki/Reinforcement_learning#:~:text=Reinforcement%20learning%20(RL)%20is%20an, supervised%20learning%20and%20unsupervised%20learning.

Hyperparameters in Reinforcement Learning:

1. LEARNING RATE (ALPHA): The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). In fully deterministic environments, a learning rate of ALPHA = 1.0 is optimal. When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as ALPHA = 0.1 for all scenarios.

float ALPHA = 0.2;

2. DISCOUNT FACTOR (GAMMA): The discount factor of 0 determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For GAMMA = 1.0, without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite. Even with a discount factor only slightly lower than 1, Q function learning leads to the propagation of errors and instabilities when the value function is approximated with an artificial neural network. In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning.

float GAMMA = 0.9;

3. EXPLORATION RATE (EPSILON): This parameter decides to what extent the robot should explore the environment. Exploring the environment means performing random actions and analyzing the results through Q Values. Usually, in Q Learning (unlike SARSA) we get rid of this parameter eventually as Robot goes on learning more and more. But In this project, we are not going to rid of Epsilon as we don’t have any terminal state. Epsilon in this case will reduce to some extent and then again get reset when it goes below a threshold value. This will make sure that the robot keeps on exploring till its lifetime just like we humans do.

float EPSILON = 0.75;

Q-LEARNING ALGORITHM:

Initialize the Q-values table, Q(s, a). I have initialized these values to 0.0.
Observe the current state, s.
Choose an action, a, for that state based on one of the action selection policies explained here on the previous page (📷-soft, 📷-greedy or softmax).

//////////////////////////Epsilon Greedy Policy//////////////////////////////

PROB = RANDOM(EPSILON);
        if (PROB<=EPSILON)     //EXPLORE THE ACTIONS 
          {
            ACTION = random(0,4);
            FLAG = 2;
          }
        else                  //EXPLOIT THE ACTIONS FROM Q TABLE
          {
            ACTION = ARGMAX(Q,STATE);
            FLAG = 2;
          }

Take the action, and observe the reward, r, as well as the new state, s'.
Update the Q-value for the state using the observed reward and the maximum reward possible for the next state. The updating is done according to the formula and parameters described above.
Set the state to the new state, and repeat the process until a terminal state is reached.
To understand Q-learning better visit this link:https://towardsdatascience.com/a-beginners-guide-to-q-learning-c3e2a30a653c

///////////////////Implementation of Q_Learning Formula/////////////////////////

  Q_OLD = Q_TABLE[S][A];                        
  Q_MAX = MAX(Q_TABLE, NEXT_S);
  Q_NEW = (1-LEARNING_RATE)*Q_OLD + LEARNING_RATE*(R + DISCOUNT_FACTOR*Q_MAX); 
  Serial.print("Q VALUE : ");
  Serial.println(Q_NEW);
  Q_TABLE[S][A] = Q_NEW;

Working Video: Don't forget to check the working video of AI Robot :)

working video of the project

Code

Obstacle Avoidance Robot using Q-LEARNING

/* This is an Obstacle Avoiding Robot using Reinforcement Learning/AI
   Author of this Project : Varun Walimbe
   Algorithm used in this project: Q learning

How Obstacle Avoiding Works?
1.Ultrasonic sensor is used measure distance from the obstacle using
  its Echo and Trig Pins.
2.When distance is measured and if its less than 20cm then there is
  an obstacle nearby otherwise robot is safe and continue Forward.
3.If obstacle is detected  then robot takes left or right turn depending
 on the situation.

How AI based Obstacle Avoidance Works?(Q learning)
1.Here the 1st step from upper article remains the same.However the
  2nd Step is different.
2.A list of actions of the robot are initialised first. For Example
 in this case actions of Robot are: Left , Forward, Backward ,Stop.
3.When the Robot comes near an obstacle it is required to perform an action.
  However note that in this case Robot doesn't know which action to take as its not pre
  programmed and going to learn on its own to avoid obstacles.
4.When Robot stops when there is an obstacle in front of it then it gets reward as 0
  When Robot stops and goes backward  it receives reward of -5
  When Robot continues to move forward ignoring the obstacles it receives reward of -10
  When Robot just moves left as soon as obstacle is detected it gets reward of +10
5.In this way Robot learns on its own to avoid obstacles by Reward Mechanism.*/

//////////ROBOT'S HARDWARE PARAMETERS////////////////////
int TRIG_PIN = 7;
int ECHO_PIN = 8;
int duration;
float distance;

int M1 = 13;
int M2 = 12;
int M3 = 11;
int M4 = 10;

bool Obstacle = false;
int FLAG;
/////////////////////////END/////////////////////////////

/////////////////////////////////////Q LEARNING PARAMETERS///////////////////////////////////////////
float ALPHA = 0.1;    //LEARNING RATE
float GAMMA = 0.5;    //DISCOUNT FACTOR
float EPSILON = 0.90; //EXPLORATION PARAMETER
int REWARD;           //REWARD FOR PERFORMING AN ACTION
int EPISODES  = 100;

int STATE;                        // CURRENT STATE OF THE ROBOT
int ACTION = 0;                   //ACTION PERFORMED BY THE ROBOT(0:FORWARD,1:BACKWARD ,2;STOP,3:LEFT)
float PROB;                       //USED FOR EPSILON DECAY 
bool ACTION_TAKEN = false;        //THIS VARIABLES TELLS US WHETHER AN ACTION IS TAKEN OR NOT
int NEXT_STATE;                   // NEXT STATE OF THE ROBOT
const int STATES = 10;            //NUMBER OF STATES IN ENVIRONMENT 
int ACTIONS[4] = {1,2,3,4};
const int NUMBER_OF_ACTIONS = 4; //TOTAL WE HAVE 4 ACTION FORWARD,BACKWARD,LEFT AND STOP


/*THIS IS THE Q MATRIX OR Q TABLE. THIS IS BASICALLY THE DIARY THAT ROBOT WILL LOOK INTO 
BEFORE PERFORMING AN ACTION.BASED ON THE ACTION THE ROBOT WILL EARN REWARD AND THE Q VALUE 
WILL BE UPDATED IN THIS Q TABLE. HERE I HAVE CONISDERED 10 STATES. I HAVE ASSUMED ALL STATES 
ARE DIFFERENT EVEN THOUGH THEY ARE SAME.BASICALLY OBSTACLE AVOIDING ROBOT CONTAINS ONLY TWO STATES
i.e: 
1:WHEN ITS AWAY FROM OBSTACLE 
2:WHEN ITS NEAR TO THE OBSTACLE
BUT HERE TO ILLUSTRATE MORE COMPLEX ENVIRONMENT I HAVE ASSUMED THERE ARE 10 DIFFERENT STATES HERE
EXPECTING SAME/DIFFERENT ACTION.*/

float Q[STATES][NUMBER_OF_ACTIONS] = {{0.0,0.0,0.0,0.0},  //MOST IMPORTANT OF ALL IS THE Q TABLE.
                                      {0.0,0.0,0.0,0.0},  //IT IS FORMED BY STATES AS ITS  ROWS 
                                      {0.0,0.0,0.0,0.0},  //AND COLLUMNS AS ITS NUMBER OF ACTIONS
                                      {0.0,0.0,0.0,0.0},  //INITIALISED TO ZERO IN THE START
                                      {0.0,0.0,0.0,0.0},  // THIS WILL UPDATED IN THE FUTURE.
                                      {0.0,0.0,0.0,0.0},
                                      {0.0,0.0,0.0,0.0},
                                      {0.0,0.0,0.0,0.0},
                                      {0.0,0.0,0.0,0.0},
                                      {0.0,0.0,0.0,0.0}};

/*THIS IS A REWARD MATRIX OR REWARD TABLE. THIS IS RESPONSIBLE FOR GIVING
REWARD TO ROBOT FOR PERFORMING PARTICULAR ACTION. IT STORES THE REWARD FOR 
EACH ACTION TAKEN AT STATE. THE REWARD WILL BE POSITIVE IF THE ACTION 
PERFORMED IS GOOD AND NEGATIVE IF ACTION YIELDS BAD RESULTS.*/

int REWARDS[STATES][NUMBER_OF_ACTIONS] = {{-10,-2,-1,10}, 
                                          {-10,-2,-1,10}, 
                                          {-10,-2,-1,10}, 
                                          {-10,-2,-1,10},  
                                          {-10,-2,-1,10},
                                          {-10,-2,-1,10},
                                          {-10,-2,-1,10},
                                          {-10,-2,-1,10},
                                          {-10,-2,-1,10},
                                          {-10,-2,-1,10}};            
////////////////////////////////////////////END///////////////////////////////////////////////////

////////////////Q LEARNING UPDATE PARAMETERS////////////
float Q_OLD;
float Q_NEW;
float Q_MAX;
//////////////////////////END//////////////////////////

void setup() 
{
  Serial.begin(9600);
  pinMode(TRIG_PIN,OUTPUT);
  pinMode(ECHO_PIN,INPUT);
  pinMode(M1,OUTPUT);
  pinMode(M2,OUTPUT);
  pinMode(M3,OUTPUT);
  pinMode(M4,OUTPUT);  
  randomSeed(analogRead(A0));
  STATE = 0;
  Serial.println("TRAINING WILL START IN 5 SECONDS:  ");
  delay(5000);
} 

////////////////////////////ROBOT'S FUNCTIONS/////////////////////////////////
void Forward()
{
  digitalWrite(M1,LOW);
  digitalWrite(M2,HIGH);
  digitalWrite(M3,LOW);
  digitalWrite(M4,HIGH); 
}

void Backward()
{
  digitalWrite(M1,HIGH);
  digitalWrite(M2,LOW);
  digitalWrite(M3,HIGH);
  digitalWrite(M4,LOW);
}

void Left()
{
  digitalWrite(M1,HIGH);
  digitalWrite(M2,LOW);
  digitalWrite(M3,LOW);
  digitalWrite(M4,HIGH);
}

void Right()
{
  digitalWrite(M1,LOW);
  digitalWrite(M2,HIGH);
  digitalWrite(M3,HIGH);
  digitalWrite(M4,LOW);
}

void Stop()
{
  digitalWrite(M1,LOW);
  digitalWrite(M2,LOW);
  digitalWrite(M3,LOW);
  digitalWrite(M4,LOW);
}

bool Obstacle_Avoider()
{
  digitalWrite(TRIG_PIN, HIGH);
  delayMicroseconds(10);
  digitalWrite(TRIG_PIN, LOW);

  duration = pulseIn(ECHO_PIN ,HIGH);
  distance = (duration/2)/29.1;
  
  if(distance<15)
  { 
    Obstacle = true;
  }

  if(distance>15)
  {
    Obstacle = false;
  }
  
  delay(10);
  return Obstacle;
}
////////////////////////////////////////////END////////////////////////////////////////////////

///////////////////////////////ROBOT'S Q LEARNING FUNCTIONS////////////////////////////////////

float RANDOM(float EXPLORATION_PARAMETER)
{
  /*THIS FUNCTION FINDS RANDOM NUMBER WHICH
  DECIDES WHETHER AN ACTION TO BE TAKEN IS RANDOM
  OR FROM Q_TABLE*/
  
  float RANDOM_VARIABLE;
  float PROBABILITY;

  RANDOM_VARIABLE = random(0,100);
  PROBABILITY = RANDOM_VARIABLE/100;

  return PROBABILITY;
}


float DECAY(float PARAMETER)
{
  /*THIS FUNCTION IS USED TO REDUCE 
  EPSILON(EXPLORATION PARAMETER) WITH
  TIME.FINALLY AT THE END YOU GET RID 
  EPSILON AND THE ROBOT LEARNS TO AVOID
  OBSTACLES ON ITS OWN */
 
  PARAMETER = PARAMETER*0.98; //PARAMETER HERE IS THE EPSILON
  return PARAMETER;
}

int GET_STATE()
{
  int STATE_NUMBER;
  STATE_NUMBER = random(0,10);
  return STATE_NUMBER;
}


float MAX(float Q_Table[][4],int NEXT_S)
{ 
  /*THIS FUNCTION FINDS THE BIGGEST NUMBER
  IN Q_TABLE[NEXT_STATE]. THE MAIN ROLE OF
  THIS FUNCTION IS TO FIND Q_MAX PARAMETER*/
  
  float LIST[4];
  float N1;
  float N2;
  float MAX_VALUE= 0.0;
  float DIFF; 

  for(int b = 0; b<=3; b++)
     {
      LIST[b] = Q[NEXT_S][b];
     }

  for(int j = 0; j<=2 ; j++)
    {
      if(MAX_VALUE >LIST[j])
        {
          N1 = MAX_VALUE;
        }
      else
        {
         N1 = LIST[j];
        }
         
      N2 = LIST[j+1];
      DIFF = N1- N2;

      if(DIFF > 0)
        {
          MAX_VALUE = N1;
        }

      else
        {
          MAX_VALUE = N2;
        }
    }   
  return MAX_VALUE;
}


int ARGMAX(float Q_Table[][4],int S)
{
  /*THIS FUNCTION FINDS THE INDEX OF 
  BIGGEST Q VALUE IN Q TABLE[STATE]*/
  
   float ARRAY[4];
   float N1;
   float N2;
   float MAX_VALUE = 0.0;
   float DIFF;
   float NUMBER;
   int MAX_INDEX;

   for(int u= 0; u<=3; u++)
     {
       ARRAY[u] = Q_Table[S][u];
     }
   
   for(int p = 0; p<=2 ; p++)
    {
      if(MAX_VALUE >ARRAY[p])
        {
          N1 = MAX_VALUE;
        }
      else
        {
         N1 = ARRAY[p];
        }
         
      N2 = ARRAY[p+1];
      DIFF = N1- N2;

      if(DIFF > 0)
        {
          MAX_VALUE = N1;
        }

      else
        {  
          MAX_VALUE = N2;
        }
    }

  for(int r = 0; r<=3;r++)
     {
      NUMBER = ARRAY[r];
      if(NUMBER == MAX_VALUE)
        {
          MAX_INDEX  = r;
          break;
        }
     }
  
  return MAX_INDEX;
}

void Update(float Q_TABLE[][4] , int S, int NEXT_S, int A, int ACTIONS[], int R, float LEARNING_RATE, float DISCOUNT_FACTOR)
{
  /*THIS FUNCTION UPDATES THE Q TABLE AND Q VALUES. THIS UPDATE KEEPS ON HAPPENING UNTILL THE 
  MAIN LOOP ENDS. AT THE END OF EPISODES THE Q TABLE IS FILLED WITH VARIOUS VALUES. THE GREATER
  THE VALUES THE GREATER IMPORTANCE THE ACTION HAS AT THAT PARTICULAR STATE. "Q_OLD" IS OLD VALUE
  THAT THE Q MATRIX HAS.THIS IS THE VALUE WHICH GETS UPDATED EVENTUALLY. Q_NEW IS THE NEW Q_VALUE 
  WHICH IS CALCULATED BY THE Q LEARNING FORMULA. THE Q LEARNING FORMULA USED HERE IS BASED ON 
  BELLMAN EQUATION USES TEMPORAL DIFFERENCE LEARNING APPROACH.(MONTE CARLO APPROACH WILL NOT
  WORK IN THIS CASE OF OBSTACLE AVOIDING ROBOT.*/

  Q_OLD = Q_TABLE[S][A];                        
  Q_MAX = MAX(Q_TABLE, NEXT_S);
  Q_NEW = (1-LEARNING_RATE)*Q_OLD + LEARNING_RATE*(R + DISCOUNT_FACTOR*Q_MAX); 
  Serial.print("Q VALUE : ");
  Serial.println(Q_NEW);
  Q_TABLE[S][A] = Q_NEW;                          
}

///////////////////////////////////////////////////////////END///////////////////////////////////////////////////////////////


/////////////////////////////////////////START OF MAIN LOOP/////////////////////////////////////////////////
void loop() 
{
  /////////////////////////////////////////TRAINING////////////////////////////////////////////
  for(int I =0; I<EPISODES; I++)
  {
    Serial.println("START :");
    ACTION_TAKEN = false;
    FLAG = 0;
       while(true)
          {
            Forward();
            Obstacle = Obstacle_Avoider();
            if(Obstacle == true)
            {
              NEXT_STATE = STATE+1;
              
              if(NEXT_STATE == 10)
                {
                 NEXT_STATE = 0;
                }

               if(NEXT_STATE < 0)
                {
                 NEXT_STATE = 0;
                }
               
              Serial.print("STATE: ");
              Serial.println(STATE);
              FLAG = 1;
              break;
            }
          }
     if(FLAG ==1)
     {
       PROB = RANDOM(EPSILON);
        if (PROB<=EPSILON)     //EXPLORE THE ACTIONS 
          {
            ACTION = random(0,4);
            FLAG = 2;
          }
        else                  //EXPLOIT THE ACTIONS FROM Q TABLE
          {
            ACTION = ARGMAX(Q,STATE);
            FLAG = 2;
          }
     }   

     if(FLAG ==2)
     {
         if(ACTION == 0)
         {
          Forward();
          delay(1500);
          Stop();
          REWARD = REWARDS[STATE][ACTION];
         }

         if(ACTION == 1)
         {
          Backward();
          delay(2500);
          Stop();
          REWARD = REWARDS[STATE][ACTION];
         }

         if(ACTION == 2)
         {
          Stop();
          REWARD = REWARDS[STATE][ACTION];
         }

         if(ACTION == 3)
         {
          Left();
          delay(2000);
          Stop();
          REWARD = REWARDS[STATE][ACTION];
         }

        ACTION_TAKEN = true;  
        delay(500); 
     }

   if(ACTION_TAKEN == true)
     {
      Update(Q,STATE,NEXT_STATE,ACTION ,ACTIONS,REWARD,ALPHA ,GAMMA);
      STATE = NEXT_STATE;
      EPSILON = DECAY(EPSILON);
      if(EPSILON<0.5)
       {
        EPSILON  == 0.9;
       }
      Serial.print("EPISODE ENDED : ");
      Serial.println(I);
      delay(7000);
     }       
  }
  /////////////////////////////////////END OF TRAINING///////////////////////////////////

 //////////////////////////////////////EVALUATION//////////////////////////////////////////
 /*USE THIS TO CHECK WHETHER YOUR Q VALUES ARE RIGHT OR WRONG. IF ALL Q VALUES ARE 
 COMING RIGHT OR SEEMS RIGHT/ACCURATE COMMENT THIS SECTION */
 for(int y = 0; y<=9 ; y++)
   {
    Serial.println("SET OF Q VALUES  WILL START:");
    for(int l = 0; l<=3; l++)
      {
        Serial.print("Q VALUE :");
        Serial.println(Q[y][l]);
        delay(2000);
      }
     delay(2000);
   }
   Serial.println("EVALUATION ENDED");
////////////////////////////////END OF EVALUATION/////////////////////////////////////////

////////////////////////////////////////TESTING////////////////////////////////////////////
while(true)
 {
  Forward();
  Obstacle = Obstacle_Avoider();
  if(Obstacle == true)
   {
     STATE = GET_STATE();
     ACTION = ARGMAX(Q,STATE);
     Serial.print("ACTION TAKEN: ");
     Serial.println(ACTION);

     if(ACTION ==0)
      {
        Forward();
        delay(1500);
        Stop();
      }

     if(ACTION == 1)
       {
        Backward();
        delay(1500);
        Stop();
       }
     if(ACTION == 2)
       {
        Stop();
       }

     if(ACTION == 3)
       {
        Left();
        delay(2000);
        Stop();
       }
     }
  }
  //////////////////////////////////////////////////END OF TESTING////////////////////////////////////////////////////////////
}
//////////////////////////////////////////////////////END OF MAIN LOOP////////////////////////////////////////////////////////