NVIDIA Generates Soundscapes in Abundance with Its Fugatto Foundational Gen-AI Model for Audio
"We envision Fugatto as a tool for creatives," NVIDIA's researchers say, "not a replacement for creativity."
A team of researchers at NVIDIA has released a foundational generative artificial intelligence (gen-AI) model for audio, from sound effects to music and speech: Foundational Generative Audio Transformer Opus 1, or Fugatto.
"We wanted to create a model that understands and generates sound like humans do," claims NVIDIA's Rafael Valle, applied audio researcher and orchestral conductor and composer, of the team's work. "Fugatto is our first step toward a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale. The first time it generated music from a prompt, it blew our minds."
Built atop the researchers' existing experience with speech modeling, audio vocoding, and audio understanding, Fugatto is a 2.5-billion parameter model trained on NVIDIA's high-end DGX systems using a dataset comprised of millions of audio samples — ranging from real-world samples to generated samples designed to expand the dataset. Like rival generative AI audio models, it turns text-based prompts — with or without example audio — into sound, but the researchers claim it eclipses rivals with emergent properties and the ability to combine free-form instructions.
"One of the model’s capabilities we’re especially proud of is what we call the avocado chair," Valle explains, referring to image-based generative AI models' ability to create items which simply don't exist in the real world — like a chair that's also an avocado. In Fugatto's case, the "avocado chairs" are music-related: a trumpet that barks, for instance, or a saxophone that meows.
Another key feature of Fugatto is its use of a technique dubbed ComposableART, which allows it to combine different aspects of its training at inference time — delivering, NVIDIA explains by way of example, text spoken with a sad feeling in a French accent despite that specific combination not being part of its training. "I wanted to let users combine attributes in a subjective or artistic way, selecting how much emphasis they put on each one," Rohan Badlani explains. "In my tests, the results were often surprising and made me feel a little bit like an artist, even though I’m a computer scientist."
Sounds generated by Fugatto can also change over time, in what Badlani calls "temporal interpolation" — and it can generate soundscapes that were not part of its training data. According to NVIDIA's internal testing, it "performs competitively" against specialized models - while offering greater flexibility.
More information is available on NVIDIA's research portal, along with a copy of the paper under open-access terms; example outputs are available on the project's demo site. "We envision Fugatto as a tool for creatives, empowering them to quickly bring their sonic fantasies and unheard sounds to life—an instrument for imagination," the researchers claim, "not a replacement for creativity."