A team of generative AI researchers has introduced a groundbreaking tool for sound manipulation, blending text and audio inputs to transform how we interact with audio. Named Fugatto—short for Foundational Generative Audio Transformer Opus 1—the AI model enables users to generate or modify music, voices, and sounds with simple text prompts.
While AI tools for music composition or voice modification are not new, Fugatto stands out for its versatility and precision. It can compose music snippets, alter the mood or accent of a voice, add or remove instruments from songs, and even generate sounds that have never been heard before.
A New Era in Audio Creativity
“This thing is wild,” said Ido Zmishlany, a multi-platinum producer, songwriter, and co-founder of One Take Audio, an NVIDIA Inception startup. “Sound inspires me to create music. With Fugatto, I can invent entirely new sounds on the fly in the studio. It’s incredible.”
According to Rafael Valle, NVIDIA’s manager of applied audio research and an orchestral composer, Fugatto reflects the team’s ambition to mimic human understanding and creation of sound. “We wanted a model that could perform like humans—versatile, intuitive, and powerful,” Valle explained.
Unprecedented Features
Fugatto introduces emergent properties, allowing it to combine previously independent capabilities into complex outputs. For instance, it can synthesize audio from free-form instructions, such as creating a saxophone that meows or a trumpet that barks. The model even supports interpolation, enabling users to control how sound evolves over time, such as a thunderstorm crescendoing and fading into birdsong at dawn.
These features are powered by ComposableART, a technique allowing the model to combine separate instructions into cohesive outputs. For example, Fugatto can generate a voice with a French accent and a sorrowful tone, while letting users fine-tune the degree of emotion or accent strength.
“In designing this, I wanted users to explore attributes in a subjective, artistic way,” said Rohan Badlani, an AI researcher involved in the project. “The results often felt like artistry, even for someone like me, a computer scientist.”
Transforming Industries
Fugatto’s potential applications span industries:
- Music Production: Producers can rapidly prototype song ideas, experiment with different styles, and enhance audio quality with minimal effort.
- Advertising: Marketers can adapt campaigns with localized accents or emotional tones for different regions.
- Education: Language-learning tools can adopt familiar voices, such as those of family members, for personalized lessons.
- Gaming: Developers can modify audio assets to match in-game action or create new sound effects on demand.
Zmishlany believes Fugatto could redefine music creation. “The electric guitar gave us rock and roll. The sampler birthed hip-hop. AI is the next chapter. This is a new instrument—a game-changer,” he said.
The Science Behind Fugatto
Fugatto is a generative transformer model built on NVIDIA’s expertise in speech modeling, audio vocoding, and audio comprehension. Trained on NVIDIA DGX systems equipped with 32 NVIDIA H100 Tensor Core GPUs, the full version comprises 2.5 billion parameters.
Developing Fugatto required curating millions of diverse audio samples, blending datasets, and analyzing relationships within the data. The international team—spanning India, Brazil, China, Jordan, and South Korea—ensured Fugatto’s multilingual and multi-accent capabilities.
Breakthrough Moments
The project took over a year to complete, with several memorable milestones. Valle recalls the first time the model successfully generated music from a text prompt. “It blew our minds,” he said.
Another highlight came during a demo where Fugatto created electronic music interspersed with dogs barking in rhythm. “When the team burst into laughter, I knew we had something special,” Valle said with a smile.
Writing the Next Chapter
Fugatto isn’t just an innovation—it’s a glimpse into the future of audio creativity. By enabling artists, creators, and industries to reimagine sound, it’s poised to leave a lasting mark on music, media, and beyond. As Zmishlany aptly put it, “We’re writing the next chapter of music history, and it’s exhilarating.”
//Staff writer