Problem Statement: The main objective is to learn to avoid obstacles in “N” Episodes and learn the optimal action. In this case, let's assume we need our Robot to learn optimal action as 'Right'.
Reinforcement Algorithm Used: Q learning
How L298N Drives Two DC Motors:For understanding this check this tutorial out at this link:https://howtomechatronics.com/tutorials/arduino/arduino-dc-motor-control-tutorial-l298n-pwm-h-bridge/#:~:text=one%20of%20them.-, L298N%20Driver, and%20explain%20how%20it%20works.
Before going to the working of the project, it is important to understand how the ultrasonic sensor works. The basic principle behind the working of the ultrasonic sensor is as follows:
Using an external trigger signal, the Trig pin on the ultrasonic sensor is made logic high for at least 10µs. A sonic burst from the transmitter module is sent. This consists of 8 pulses of 40KHz.
The signals return back after hitting a surface and the receiver detects this signal. The Echo pin is high from the time of sending the signal and receiving it. This time can be converted to distance using appropriate calculations.
The aim of this project is to implement an obstacle avoiding robot using an ultrasonic sensor and Arduino. All the connections are made as per the circuit diagram. The working of the project is explained below.
When the robot is powered on, both the motors of the robot will run normally and the robot moves forward. During this time, the ultrasonic sensor continuously calculates the distance between the robot and the reflective surface.
This information is processed by the Arduino. If the distance between the robot and the obstacle is less than 15cm, the Robot stops and scans in the left and right directions for new distance using Servo Motor and Ultrasonic Sensor. If the distance towards the left side is more than that of the right side, the robot will prepare for a left turn. But first, it backs up a little bit and then activates the Left Wheel Motor in reversed in direction.
Similarly, if the right distance is more than that of the left distance, the Robot prepares the right rotation. This process continues forever and the robot keeps on moving without hitting any obstacle.Important Terms in Reinforcement Learning:
1. STATE: This is the situation in which the Robot is. Here for a basic Obstacle Avoiding Robot, there are in total 2 states ……1st state is when there is no obstacle close to it and 2ndstate in which there is an obstacle in front of it.(when I wrote the code I assumed 10 different states can be in which expected the same action. The reason I did this to illustrate a more complex environment.)
2. ACTION: In a particular State the robot performs a particular action. There are 4 actions which the robot can perform in 2nd state: “FORWARD”, “BACKWARD”, “LEFT”, “STOP”. In 1st state, the robot can perform 4 actions but to make things easier I have assumed that robot can perform only one action which is “FORWARD”( This is because its lame to consider actions such as LEFT or BACKWARD when there are no obstacles nearby.
int ACTIONS = [0,1,2,3]
/* HERE :
0 = FORWARD
1 = BACKWARD
2 = STOP
3 = RIGHT*/
3. NEXT STATE: This is the state robot gets in when it performs a particular “ACTION” in its current “STATE”. In obstacle avoiding the robot case, the NEXT STATE can be either a “CRASHED” state or a “SURVIVED” State. (Here SURVIVE state is same as the starting state that robot is in when its episode starts.)
/*AFTER PERFOMING AN ACTION THE ROBOT GOES INTO NEXT STATE IN THIS CASE OF OBSTACLE
AVOIDING ROBOT*/
int NEXT_STATE;
int STATE = 0;
NEXT_STATE = STATE+1;
4. Q TABLE / Q MATRIX: This table is formed by Number of “STATES” and Number of “ACTIONS”. In Obstacle avoidance Robot’s Case, this Table is given by:
float Q[N_STATES][NUMBER_OF_ACTIONS] = {{0.0,0.0,0.0,0.0},
{0.0,0.0,0.0,0.0},
{0.0,0.0,0.0,0.0},
{0.0,0.0,0.0,0.0},
{0.0,0.0,0.0,0.0},
{0.0,0.0,0.0,0.0},
{0.0,0.0,0.0,0.0},
{0.0,0.0,0.0,0.0},
{0.0,0.0,0.0,0.0},
{0.0,0.0,0.0,0.0}};
Here N_STATES = 10 AND N_ACTIONS = 4. here "0.0" indicates that any action can be performed from any of the 4 possible actions. if you, however, want to eliminate a particulate action in a state just replace "0.0" with "-1.0" in the matrix. "-1.0" indicates that the action cannot be performed in that state. here it is assumed that we have 10 different states with each state expecting the same action. if you want your robot to learn actions that are different in each state then change the rewards from the reward matrix in the code
5. TERMINAL STATE: This is the last state in which the robot can be in. For obstacle Avoiding Robot, this state doesn’t exist as we don’t have any terminal state and want to keep our Robot learning forever.
6. REWARD MATRIX: This table or matrix is used to give rewards to the robot for certain actions. The reward is positive or negative depending upon the quality of the action.
int REWARDS[STATES][NUMBER_OF_ACTIONS] = {{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10}};
7. ENVIRONMENT: This can also be assumed or considered as the world for the Robot. For example, we humans live on earth so basically earth is our environment.
To understand Reinforcement Learning better visit this link :https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/
Or Else visit this: https://en.wikipedia.org/wiki/Reinforcement_learning#:~:text=Reinforcement%20learning%20(RL)%20is%20an, supervised%20learning%20and%20unsupervised%20learning.Hyperparameters in Reinforcement Learning:
1. LEARNING RATE (ALPHA): The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). In fully deterministic environments, a learning rate of ALPHA = 1.0 is optimal. When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as ALPHA = 0.1 for all scenarios.
float ALPHA = 0.2;
2. DISCOUNT FACTOR (GAMMA): The discount factor of 0 determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For GAMMA = 1.0, without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite. Even with a discount factor only slightly lower than 1, Q function learning leads to the propagation of errors and instabilities when the value function is approximated with an artificial neural network. In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning.
float GAMMA = 0.9;
3. EXPLORATION RATE (EPSILON): This parameter decides to what extent the robot should explore the environment. Exploring the environment means performing random actions and analyzing the results through Q Values. Usually, in Q Learning (unlike SARSA) we get rid of this parameter eventually as Robot goes on learning more and more. But In this project, we are not going to rid of Epsilon as we don’t have any terminal state. Epsilon in this case will reduce to some extent and then again get reset when it goes below a threshold value. This will make sure that the robot keeps on exploring till its lifetime just like we humans do.
float EPSILON = 0.75;
Q-LEARNING ALGORITHM:
- Initialize the Q-values table, Q(s, a). I have initialized these values to 0.0.
- Observe the current state, s.
- Choose an action, a, for that state based on one of the action selection policies explained here on the previous page (📷-soft, 📷-greedy or softmax).
//////////////////////////Epsilon Greedy Policy//////////////////////////////
PROB = RANDOM(EPSILON);
if (PROB<=EPSILON) //EXPLORE THE ACTIONS
{
ACTION = random(0,4);
FLAG = 2;
}
else //EXPLOIT THE ACTIONS FROM Q TABLE
{
ACTION = ARGMAX(Q,STATE);
FLAG = 2;
}
- Take the action, and observe the reward, r, as well as the new state, s'.
- Update the Q-value for the state using the observed reward and the maximum reward possible for the next state. The updating is done according to the formula and parameters described above.
- Set the state to the new state, and repeat the process until a terminal state is reached.
- To understand Q-learning better visit this link:https://towardsdatascience.com/a-beginners-guide-to-q-learning-c3e2a30a653c
///////////////////Implementation of Q_Learning Formula/////////////////////////
Q_OLD = Q_TABLE[S][A];
Q_MAX = MAX(Q_TABLE, NEXT_S);
Q_NEW = (1-LEARNING_RATE)*Q_OLD + LEARNING_RATE*(R + DISCOUNT_FACTOR*Q_MAX);
Serial.print("Q VALUE : ");
Serial.println(Q_NEW);
Q_TABLE[S][A] = Q_NEW;
Working Video: Don't forget to check the working video of AI Robot :)
Comments
Please log in or sign up to comment.