Follow the Leader
GUIDE, a human-guided AI framework, uses real-time feedback and a human-simulating algorithm to produce more robust autonomous systems.
In many ways, moving the field of machine learning forward is like playing a game of Whac-A-Mole. As one area advances to the point that it is ready to solve real-world problems, other areas that are still sorely lacking become more apparent. This situation is playing out today as advanced algorithms grow increasingly capable, yet we find that — massive as they may be — the available training datasets are often insufficient to produce robust and well-generalized models. As a result, we have autonomous robots that get confused the moment they encounter a situation that deviates from the distribution of their training data.
Human-guided reinforcement learning (RL) has been proposed to help fill in the knowledge gaps left by traditional training methods. This approach relies on demonstrations performed by experts, which machines then learn to imitate. But once again, this technique requires very large datasets to be successful, and they are very time-consuming and expensive to compile. Furthermore, existing methods are only compatible with offline learning, which means autonomous systems cannot learn in the field, in real-time.
Duke University recently teamed up with the Army Research Laboratory to develop a new human-guided RL framework named, very appropriately, GUIDE. This approach enables continuous and real-time feedback from humans to accelerate policy learning. During this guidance process, a parallel training algorithm also learns to simulate human feedback. In this way, the algorithm can continue to be trained — in a simulated environment — long after human trainers have called it a day.
The system's design centers around an interactive feedback loop where human trainers provide real-time assessments of the agent’s actions using a novel interface. Instead of relying on the discrete feedback methods of previous approaches, such as clicking buttons to label an action as “good,” “bad,” or “neutral,” GUIDE enables trainers to hover a mouse cursor over a gradient scale to deliver feedback continuously. This method fosters natural engagement, allows for more expressive feedback, and ensures constant training by providing ongoing adjustments. Furthermore, GUIDE simplifies the challenge of associating delayed feedback with specific actions by assuming a consistent feedback delay, enabling smoother integration into the learning process.
GUIDE also combines human feedback with sparse environment rewards to shape the algorithm’s behavior effectively. While human feedback offers nuanced guidance, environment rewards provide broader objectives that reinforce desirable outcomes. By converting human feedback into dense rewards and seamlessly integrating them with environment rewards, GUIDE allows the use of advanced RL algorithms without significant modifications. This interactive reward-shaping approach is particularly useful for long-horizon tasks where predefined dense reward functions would require substantial manual effort and fail to adapt to unforeseen scenarios.
To reduce reliance on human input over time, GUIDE incorporates a regression model that learns to mimic human feedback. This model is trained concurrently during the human-guided phase by collecting state-action pairs and their corresponding feedback values. The resulting neural network acts as a surrogate human trainer, providing consistent feedback when human involvement is no longer feasible. By minimizing the difference between actual human feedback and its predictions, the model ensures that the learning process remains aligned with the original training objectives.
To assess the performance of GUIDE, an experiment was conducted with a hide-and-seek computer game. It involved a one-on-one scenario where a seeker, guided by the AI, had to navigate a maze to locate a hider that moved based on simple heuristic behaviors. When compared with other RL-based approaches, it was found that GUIDE had achieved a 30 percent higher success rate.
The researchers’ initial work focused on relatively simple tasks. Moving forward, they intend to experiment with more complex scenarios with the hope that it will get GUIDE ready for real-world use.