DeepMind is advancing its AI audio generation technology, unveiling their latest innovations to democratizing music creation.
DeepMind’s new suite of AI music tools—MusicFX DJ, Music AI Sandbox, and enhancements to YouTube’s Dream Track experiment—aims to make music production more intuitive and accessible to everyone, from seasoned musicians to novices.
MusicFX DJ allows users to generate live music in real-time, crafting melodies and rhythms without prior musical training. Users to mix musical concepts through simple text prompts, blend genres, adjust tempos, and introduce various instruments.
The technology is derived from an adapted, offline generative music model for real-time streaming. In other words, the model generates the next segment of music based on both the previously generated music and the user’s input, in near real time. Instead of relying on a single text prompt, the model uses a combination of prompts. It mixes their representations, known as embeddings, and weights them according to the user’s preferences adjusted via sliders. It provides for a dynamic and responsive music generation experience.
Music AI Sandbox is also a DeepMind offering for an experimental suite designed to supercharge the workflows of musicians, producers, and songwriters. While still in the testing phase, this toolkit is proving invaluable. It includes features like loop generation, sound transformation, and in-painting, which helps seamlessly connect parts of a musical track. Trusted testers can sketch songs and use a multi-track view to organize and refine their compositions with precision.
DeepMind’s also advancing in speech generation with NotebookLM Audio Overviews. By harnessing the capabilities of Google’s experimental AI-first notebook, NotebookLM, users can transform uploaded documents into engaging audio dialogues. Two AI-generated hosts summarize material, make connections between topics, and banter back and forth, making complex content more digestible.
Their speech generation models are built upon pioneering techniques like SoundStream and AudioLM. SoundStream is a neural audio codec that efficiently compresses and decompresses audio without compromising quality. It maps audio to acoustic tokens that capture essential information, including prosody and timbre. AudioLM treats audio generation as a language modeling task, producing sequences of these acoustic tokens. This approach allows the model to handle various types of audio without needing architectural adjustments.
To generate long-form dialogues, DeepMind developed an even more efficient speech codec that compresses audio into sequences of tokens at as low as 600 bits per second. These tokens have a hierarchical structure grouped by time frames. Early tokens capture phonetic and prosodic information, while later tokens encode fine acoustic details. This hierarchical approach enables the models to generate up to two minutes of conversation with improved naturalness and speaker consistency.
The models were pretrained on hundreds of thousands of hours of speech data and fine-tuned on high-quality dialogue datasets with precise speaker annotations. They can generate audio over 40 times faster than real-time, performing tasks in under three seconds on a single TPU v5e chip.
Addressing ethical considerations, they’re incorporating technologies like SynthID to watermark non-transient AI-generated audio. This helps safeguard against potential misuse, ensuring transparency and maintaining trust with users.
For future iterations, they’re exploring ways to improve model fluency, acoustic quality, and adding more fine-grained controls for features like prosody. There’s also interest in combining these advances with other modalities, such as video, to create even richer multimedia experiences.