Our project builds off a robot base from ME 461: A two-wheeled robot with a caster controlled by a TI Launchpad F28379D microcontroller. The robot also has a 10-dof IMU, microphone, and two wheel encoders. For our project, we added a Raspberry Pi 4 to perform more intensive computations, and a USB webcam to give our robot perception capabilities.
The robot runs off a 12V 3S rechargeable lithium battery, which is regulated into 3.3V for the microcontroller and 5.25V for the Raspberry Pi. The Raspberry Pi is jumpered to the Launchpad for serial communication, and it talks to the webcam through USB.
Dead ReckoningThe robot fuses its encoder data and gyro data using a variation on a complementary filter in order to get good dead reckoning. Based on observations of the robot, we reasoned that while gyro integration gave very steady and smooth bearing readings, immune to wheel slippage, in the long run, the wheel encoders’ measurement would be the only accurate measure for if the robot was stationary or not. Based on this, we adopted the heuristic that if the angular velocity, measured through the encoders and the gyro, are both low, the robot must not be turning. The robot then uses the wheel encoders’ measured angular velocity as the ground truth to correct for gyro bias, but weighted according to the heuristic above: If the robot is not turning, the filter will aggressively correct the gyro bias to zero; if not, the robot will trust in the gyro readings.
The Launchpad is able to run control and sensing code at a rate of 1kHz consistently, making it much more suited for real-time control than the Pi. We take advantage of its processing power to filter all the sensor inputs, most notably the wheel encoder readings: The wheel encoders on the robot are pretty coarse, at only 600 counts per revolution; to get a smooth velocity signal out we apply a low pass filter to the wheel positions, and then a derivative filter to compute raw velocities, and finally another low pass filter to get smoothed velocities. The total delay induced by all this filtering is only on the order of ~4ms thanks to the high sample rate.
Launchpad communication and ControlThe Launchpad publishes the position and velocity information of the robot through serial communication to the Pi at a rate of <X> Hz. This information contains a “index” stamp which counts up with every packet sent, the data itself (x, y, heading, velocity, and angular velocity), and is newline-terminated. The “index” and newline separators allow the Pi to determine if it has lost synchronization with the Launchpad, which triggers it to go into “newline-seeking” mode, where it tries to re-align itself with the Launchpad’s data stream by searching for newlines in the incoming data.
The Pi can send two commands to the Launchpad: Commanded forward and commanded turn velocity. This packet is newline separated and is not pushed continuously – the Launchpad can receive packets relatively slowly. A PI controller on the Launchpad tries to maintain the target velocities given. Commands sent to the Launchpad automatically expire after 0.25s, so program crashes that happen on the Pi will not cause the robot to go out of control.
Internally the Launchpad integrates the (linearized) equations of motion of the robot to calculate its position and heading. These are never corrected by the SLAM algorithm – instead, the SLAM algorithm stores an “offset” transform that records how far the dead reckoning has drifted from what it thinks is the ground truth, and uses this to recover the correct displacement vectors.
Camera CalibrationOne of the first things we did when setting up the project was to calibrate the webcam’s intrinsics and distortion parameters. Since our SLAM method would rely on geometric relationships between pixels in the image, it was critical that our algorithm could have clean, undistorted images as input. Calibration was done with a large calibration grid, and intrinsics recovered with OpenCV.
This process gives us more geometrically accurate images, but slightly reduces the effective field of view of the camera. One possible improvement would be to incorporate the distortion effects of the camera later in the SLAM pipeline, to maximize the information gained from the camera.
SLAMA two-part system is used by the robot to do basic, 2D SLAM. The robot maintains a 2D discretized grid map of the world to use in motion planning, but also keeps a separate set of detected lines (ideally walls) to help localize itself. I did not have enough time to investigate using single-camera stereo matching to do odometry and 3D reconstruction, so I opted for a heuristic method: By detecting straight line edges which are relatively common in real world obstacles (and especially in our testing environment). This is accomplished by using the Canny edge detector to detect edges, then taking the bottommost white pixel in each row in the image and running a randomized Hough transform over those to get a set of lines. Using a calibrated camera matrix, those points are projected onto the ground plane (we assume all edges are formed by the intersection of something with the ground), which is then used to construct a 2D map of the environment.
Based on our classroom environment, we found that an effective heuristic for distinguishing the “walls” from the “floor” was to perform the edge detection over the saturation channel of the input image, converted to HSV: This was able to better ignore markings and other patterning on the floor.
A clustering technique is used to reduce the amount of data produced for the next steps. The Hough transform tends to output ~20 lines, which isn’t too much but would still overload the iterative/O(n^3) algorithms used for the next steps (which I didn’t have time to optimize). The DBSCAN algorithm is used to group similar lines. I couldn’t find literature establishing a suitable distance metric for lines which would take into account both angular misalignment and translational misalignment, so I came up with the following:
where min_dist represents the minimum euclidean distance between the two lines, computed using point-line checks. The reason for this metric is because I wanted lines that were close in both angle and endpoint positions to be merged, but other cases to be left unchanged. One potential issue with this metric is it isn’t actually a distance metric (it fails the triangle inequality test).
Even after merging, the line data is quite noisy due to the nature of the edge detection heuristic. Before integrating into the global map, the lines are kept for a few loop iterations to see if they persist, filtering out the noisiest of signals.
Every few loop iterations, the accumulated, partially filtered line segments are aligned to the global map using a brute force, “tiered” search. I used a loss function based on the Hausdorff distance metric, but capped at a certain distance per point, to allow for new lines disjoint from the map to be added without excessive penalty.
To integrate the (sparse) set of lines with a dense world map / occupancy grid, a mask is calculated using the walls the robot can see. The robot does some basic occlusion calculations (in 2D only, so pretty simple but inefficiently implemented in n^2 time) to figure out which pixels it can see based on the walls, and assumes conservatively that gaps in the walls are filled by more wall. This isn’t a problem since if the robot wants to explore, it can move closer to the gap and see that there is space beyond the gap.
The mask is coarsely divided into different confidence levels depending on how far it is away from the robot itself, and then used to update the occupancy grid to remove “wall” tiles (the grid starts out as completely populated with wall, except a space around the robot which it assumes is safe.)
Our algorithm still struggles somewhat with loop closure – because it does not save the transforms of “line clouds”, instead simply registering each observation against all previous observations at once, and then adding it to the observation set, it does poorly when asked to make large corrections, which would be required for better loop closure. We partially get around this by virtue of having good dead reckoning, and also by frequently performing “spins” to capture features and align the robot relative to previous “line clouds”, but moving for long periods of time tends to cause the robot’s map to warp and shift.
During testing, this algorithm was implemented in Python for ease of development and tested against recorded sequences of images and poses. Python gave an unacceptable framerate, so the entire algorithm was ported to C++ for real-time performance. The final algorithm runs at about 15-17Hz on the Raspberry Pi 4.
Motion PlanningMotion planning for the robot is split into three layers. At the lowest level, the robot can follow a set of waypoints using a pure pursuit controller implemented in python, which sends velocity commands to the Launchpad. This is set to a relatively low velocity to account for the low frame rate of the SLAM algorithm, and also reduce wheel slippage.
At the next level up, the robot uses a Fast Marching Method (FMM) planner, implemented by the Klampt library, to execute “go-to-waypoint” commands. The planner uses a generous clearance around the robot to account for tracking error introduced by the pure pursuit controller, and because it plans only in X and Y, ignoring how the robot’s collision box changes as its heading changes. A TODO for this project is to extend the planner to plan in the true configuration space of the robot (x, y, θ), to allow the clearance to be reduced, as right now the planner is slightly too conservative.
The highest level planner is a random walk controller, randomly sampling feasible points of the configuration space and sending them to the “go-to-waypoint” controller. It samples points in a given radius around the robot, starting from the larger radii, to try to encourage exploration while not losing position tracking. This planner can also be overridden to enable manual control of the robot (commanded velocity), or manual waypoint setting.
Object Detection and PronunciationOur robot car is able to recognize up to 80 common objects and pronounce the name of the object as soon as detected. The object detection and pronunciation feature of this robot car depend on the monocular camera and a speaker controlled by the raspberry pi. The pronunciation part is straightforward, with the espeak package installed and the output name given by the object detection. The object detection is achieved with the YOLO algorithm, which is a state-of-the-art, real-time object detection system. The website that explains the YOLO algorithm and from which I downloaded the trained weight and configuration files is https://pjreddie.com/darknet/yolo/. For this project, we used YOLO version 3-tiny for faster frame rates. Objects seen by the camera will be evaluated by the YOLO algorithm with different confidence scores for all the trained objects. If the highest confidence value is higher than the threshold, which is set to 0.5 for this project, this object is recognized as the trained object corresponding to the value. A bounding box that roughly predicts the border of the object is drawn with the name and confidence score displayed.
In addition to YOLO, to find the most suitable deep learning algorithm for this project, we have also tried SSD, Single Shot MultiBox Detector. With a significantly longer configuration file, SSD has a lower frame rate when running. Therefore, we chose YOLO for its lighter weight when running. This was especially important for our project because although the Raspberry Pi 4 is fairly powerful, running the object detector and real-time SLAM algorithms together is very resource heavy – with our current settings taking up nearly all the processing power available on the Pi, and that’s after manually throttling the framerate of the object detector!
Since the object detection reloads every frame, the pronunciation of the object could be annoying. So a variable that keeps track of the previous object detected is added, which prevents the speaker from repeatedly pronouncing the same object over and over again.
We have also tried to undistort the input image or frames of the videos for better detection performance, but it turns out that it works just fine even with distorted images. Eventually, we didn’t add the undistorting feature due to optimization of the CPU usage, but the code for undistorting is still in the file and can be put back to use simply by uncommenting.
System ArchitectureThe robot runs an http server, which hosts its user interface and allows the various components to communicate with each other. The pieces of information that require the largest data transfers (raw camera images and SLAM maps) are stored in shared memory; lighter information such as the robot pose information and commands are transferred over HTTP for ease of use and extensibility. This setup allows the robot to “share” communication with the Launchpad and webcam with multiple processes – by default, only one process would be able to communicate with a serial port or USB camera at a time.
Comments
Please log in or sign up to comment.