The future of gaming includes generative AI, whether that be for terrain generation, non-player character (NPC) AI, or voice generation. One area that benefits from AI is in-game dialogue and cut-scenes. Gaming studios often make rendered voiced acted cut-scenes that equal some movie productions, however many developers don't add voices to their in-game dialogue. Older games can revamped with the addition of voices to their cutscenes.
What if developers don't know what dialogue is going to be spoken due to player input? Games such as League of Legends, have players take the control of specific characters with distinct voices. These games could improve player immersion and accessibility by providing AI generated player chat.
VEGI (Voice-Enhanced Gaming Interface) (pronounced 've-jē') using different AI models to add voice to video games efficiently by offloading the AI workloads to AMD's NPU, freeing up the CPU/GPU for the game critical computation.
NPUAMD's Ryzen AI CPUs provides the ability to offload AI models to their Neural Processing Unit (NPU) which can free up CPU cycles for more important tasks. One of the main use-cases is gaming, as the performance heavily relies on the CPU and GPU. By utilizing a NPU, VEGI can ensure that game performance doesn't drop drastically.
AMD provides a few different workflows ONNX, and PyTorch. Our model architectures for the OCR, Sentiment Analysis and TTS were better suited for PyTorch. The PyTorch workflow consists of quantizing and converting the models, specifically Linear layers would be converted to Ryzen AI compatible QLinear layers. Layer conversion must happen at runtime, since as of now there is not way to save/load converted models; however the quantization step can be avoided by preprocessing the quantized model and saving it to a file.
Dialogue ExtractionVEGI offers two distinct methods of Dialogue extraction from games, OCR and Overwolf. OCR (Optical Character Recognition) is a common AI technique to translate images of text into text data; older games such as Pokémon can benefit from OCR, as there is no way to intercept chat/text data from the game. For this dialogue extraction method, VEGI utilizes an existing OCR solution: EasyOCR. EasyOCR consists of a decoder and encoder PyTorch model, which both need to be converted to NPU compatibly models using the Layer Conversion method.
The other technique is targeted towards modern multiplier games, a gaming overlay called Overwolf allows for user to extract game data from many supported online games. In our example, we use Overwolf to support our League of Legend's Champion voices. Overwolf is only supported for a select few multiplayer games, but directly extracts text eliminating the need to run any AI models.
Dialogue AnalysisSentiment is extracted from the text using a transformer model that was trained in PyTorch to detect 21 different emotions. The original idea was to take the sentiment produced from the analysis and pass it to the TTS model to create emotive voices; however as this is still and emerging technology, this feature is considered to be future work.
TTSAfter testing many different TTS models from Bark to SpeechT5, due to size and latency constraints for VEGI, VITS was chosen as the TTS model. The linear layers of the model are converted to NPU compatible QLinear layers on runtime. A TODO is to add Retrieval-Based Voice Conversion (RVC) to the pipeline which would allow for character specific voices to be produced.
DEMOBelow are two demo videos of VEGI in action. VEGI is open source and steps to run the software on CPU/NPU are available on the GitHub.
TODOs- Implementing Retrival-Based Voice Conversion on the NPU to provide character voices.
- Sentiment aware voices is an emerging research topic, and once well established could be added to make the voices produce more emotion.
- Adding GPU support. (Only supported on CPU and NPU)
Competitive games such as Valorant, have the need for players to communicate to complete objectives and often have helpful callouts for teammates in game chat. Often these callouts are either overlooked or drowned out by other messages. AI can be used to select the important messages to be spoken out loud.
Comments