How Meta’s AI Generates Music Based mostly on a Reference Melody | by Max Hilsdorf | Jun, 2023

MusicGen, analyzed

Picture by creator.

On June thirteenth, 2023, Meta (previously Fb) made waves within the music and AI communities with the discharge of their generative music mannequin, MusicGen. This mannequin not solely surpasses Google’s MusicLM, which was launched earlier this yr, when it comes to capabilities however can also be educated on licensed music knowledge and open-sourced for non-commercial use.

This implies you can not solely learn the research paper or hearken to demos but additionally copy their code from GitHub or experiment with the mannequin in an internet app on HuggingFace.

Along with producing audio from a textual content immediate, MusicGen also can generate music primarily based on a given reference melody, a function generally known as melody conditioning. On this weblog put up, I’ll show how Meta carried out this handy and interesting performance into their mannequin. However earlier than we delve into that, let’s first perceive how melody conditioning works in follow.

Base Observe

The next is a brief digital music snippet that I produced for this text. It options digital drums, two dominant 808 bass and two syncopated synths. When listening to it, attempt to establish the “important melody” of the monitor.

Utilizing MusicGen, I can now generate music in different genres that follow the identical important melody. All I would like for that’s my base monitor and a textual content immediate describing how the brand new piece ought to sound.

Orchestral Variant

A grand orchestral association with thunderous percussion, epic brass fanfares, and hovering strings, making a cinematic ambiance match for a heroic battle.

Reggae Variant

traditional reggae monitor with an digital guitar solo

Jazz Variant

easy jazz, with a saxophone solo, piano chords, and snare full drums

How Good are the Outcomes?

Though MusicGen doesn’t adhere intently to my textual content prompts and creates music that’s barely totally different from what I requested for, the generated items nonetheless precisely replicate the requested style and, extra importantly, each bit showcases its personal interpretation of the principle melody from the bottom monitor.

Whereas the outcomes will not be good, I discover the capabilities of this mannequin to be fairly spectacular. The truth that MusicGen has been one of the vital standard fashions on HuggingFace ever since its launch additional emphasizes its significance. With that mentioned, let’s delve deeper into the technical features of how melody conditioning works.

Three text-music pairs as they’re used for coaching fashions like MusicLM or MusicGen. Picture by creator.

Virtually all present generative music fashions comply with the identical process throughout coaching. They’re supplied with a big database of music tracks accompanied by corresponding textual content descriptions. The mannequin learns the connection between phrases and sounds, in addition to convert a given textual content immediate right into a coherent and pleasurable piece of music. In the course of the coaching course of, the mannequin optimizes its personal compositions by evaluating them to the actual music tracks within the dataset. This permits the mannequin to establish its strengths and areas that require enchancment.

The problem lies in the truth that as soon as a machine studying mannequin is educated for a particular activity, akin to text-to-music technology, it’s restricted to that specific activity. Whereas it’s doable to make MusicGen carry out sure duties that it was not explicitly educated for, like persevering with a given piece of music, it can’t be anticipated to deal with each music technology request. As an illustration, it can not merely take a melody and rework it into a special style. This might be like throwing potatoes right into a toaster and anticipating fries to come back out. As an alternative, a separate mannequin should be educated to implement this performance.

Let’s discover how Meta tailored the mannequin coaching process to allow MusicGen to generate variations of a given melody primarily based on a textual content immediate. Nevertheless, there are a number of challenges related to this strategy. One of many main obstacles is the paradox in figuring out “the melody” of a music and representing it in a computationally significant method. Nonetheless, for the aim of understanding the brand new coaching process at a broader stage, let’s assume a consensus on what constitutes “the melody” and the way it may be simply extracted and fed into the mannequin. On this situation, the adjusted coaching technique could be outlined as follows:

Three text-music-melody pairs as they have been used for educating MusicGen melody-conditioned technology.

For every monitor within the database, step one is to extract its melody. Subsequently, the mannequin is fed with each the monitor’s textual content description and its corresponding melody, prompting the mannequin to recreate the unique monitor. Primarily, this strategy simplifies the unique coaching goal, the place the mannequin was solely tasked with recreating the monitor primarily based on textual content.

To grasp why we do that, let’s ask ourselves what the AI mannequin learns on this coaching process. In essence, it learns how a melody could be become a full piece of music primarily based on a textual content description. Which means after the coaching, we will present the mannequin with a melody and request it to compose a bit of music with any style, temper, or instrumentation. To the mannequin, this is identical “semi-blind” technology activity it has efficiently completed numerous instances throughout coaching.

Having grasped the method employed by Meta to show the mannequin melody-conditioned music technology, we nonetheless must deal with the problem of exactly defining what constitutes “the melody.”

The reality is, there is no such thing as a goal technique to find out or extract “the melody” of a polyphonic musical piece, besides when all devices are taking part in in unison. Whereas there may be usually a outstanding instrument akin to a voice, guitar, or violin, it doesn’t essentially suggest that the opposite devices will not be a part of “the melody.” Take Queen’s “Bohemian Rhapsody” for example. Whenever you consider the music, you would possibly first recall Freddie Mercury’s important vocal melodies. Nevertheless, does that imply the piano within the intro, the background singers within the center part, and the electrical guitar earlier than “So that you assume you’ll be able to stone me […]” will not be a part of the melody?

One technique for extracting “the melody” of a music is to think about probably the most outstanding melody as probably the most dominant one, sometimes recognized because the loudest melody within the combine. The chromagram is a broadly utilized illustration that visually shows probably the most dominant musical notes all through a monitor. Under, you could find the chromagram of the reference monitor, initially with the whole instrumentation after which excluding drums and bass. On the left aspect, probably the most related notes for the melody (B, F#, G) are highlighted in blue.

Each chromagrams precisely depict the first melody notes, with the model of the monitor with out drums and bass offering a clearer visualization of the melody. Meta’s examine additionally revealed the identical remark, which led them to make the most of their supply separation device (DEMUCS) to take away any disturbing rhythmic components from the monitor. This course of ends in a sufficiently consultant rendition of “the melody,” which may then be fed to the mannequin.

In abstract, we will now join the items to grasp the underlying course of when requesting MusicGen to carry out melody-conditioned technology. Here’s a visible illustration of the workflow:

How MusicGen produces a melody-conditioned music output. Picture by creator.
Photograph by Xavier von Erlach on Unsplash

Whereas MusicGen reveals promising developments in melody-conditioning, it is very important acknowledge that the know-how continues to be a work-in-progress. Chromagrams, even when drums and bass are eliminated, provide an imperfect illustration of a monitor’s melody. One limitation is that chromagrams categorize all notes into the 12 western pitch courses, which means they seize the transition between two pitch courses however not the course (up or down) of the melody.

As an illustration, the melodic interval between shifting from C4 to G4 (an ideal fifth) differs considerably from shifting from C4 to G3 (an ideal fourth). Nevertheless, in a chromagram, each intervals would seem the identical. The problem worsens with octave jumps, because the chromagram would point out the melody stayed on the identical notice. Think about how a chromagram would misread the emotional octave leap carried out by Céline Dion in “My Coronary heart Will Go On” in the course of the line “wher-e-ver you might be” as a secure melodic motion. To show this, simply have a look at the chromagram for the refrain in A-ha’s “Tackle Me”, under. Does this replicate your thought of the music’s melody?

A chromagram of the refrain in “Tackle Me” (A-ha), bass and drums eliminated. Picture by creator.

One other problem is the inherent bias of the chromagram. It performs properly in capturing the melody of some songs whereas fully lacking the mark in others. This bias is systematic slightly than random. Songs with dominant melodies, minimal interval jumps, and unison taking part in are higher represented by the chromagram in comparison with songs with complicated melodies unfold throughout a number of devices and that includes massive interval jumps.

Moreover, the restrictions of the generative AI mannequin itself are value noting. The output audio nonetheless displays noticeable variations from human-made music, and sustaining a constant type over a six-second interval stays a wrestle. Furthermore, MusicGen falls brief in faithfully capturing the extra intricate features of the textual content immediate, as evidenced by the examples supplied earlier. It is going to require additional technological developments for melody-conditioned technology to achieve a stage the place it may be used not just for amusement and inspiration but additionally for producing end-user-friendly music.

Photograph by Marc Sendra Martorell on Unsplash

How can we enhance the AI?

From my perspective, one of many main issues that future analysis ought to tackle relating to melody-conditioned music technology is the extraction and illustration of “the melody” from a monitor. Whereas the chromagram is a well-established and easy sign processing technique, there are quite a few newer and experimental approaches that make the most of deep studying for this objective. It might be thrilling to witness corporations like Meta drawing inspiration from these developments, a lot of that are lined in a complete 72-page evaluation by Reddy et al. (2022).

Concerning the standard of the mannequin itself, each the audio high quality and the comprehension of textual content inputs could be enhanced via scaling up the dimensions of the mannequin and coaching knowledge, in addition to the event of extra environment friendly algorithms for this particular activity. In my view, the discharge of MusicLM in January 2023 resembles a “GPT-2 second.” We’re starting to witness the capabilities of those fashions, however vital enhancements are nonetheless wanted throughout varied features. If this analogy holds true, we will anticipate the discharge of a music technology mannequin akin to GPT-3 before we would anticipate.

How does this affect musicians?

As is usually the case with generative music AI, issues come up relating to the potential damaging affect on the work and livelihoods of music creators. I anticipate that sooner or later, it’ll turn into more and more difficult to earn a dwelling by creating variations of current melodies. That is notably evident in situations akin to jingle manufacturing, the place corporations can effortlessly generate quite a few variations of a attribute jingle melody at minimal value for brand new advert campaigns or personalised ads. Undoubtedly, this poses a risk to musicians who depend on such actions as a major supply of earnings. I reiterate my plea for creatives concerned in producing music valued for its goal musical qualities slightly than subjective, human qualities (akin to inventory music or jingles) to discover different earnings sources to arrange for the long run.

On the optimistic aspect, melody-conditioned music technology presents an unimaginable device for enhancing human creativity. If somebody develops a fascinating and memorable melody, they will rapidly generate examples of the way it would possibly sound in varied genres. This course of will help establish the best style and magnificence to convey the music to life. Furthermore, it provides a possibility to revisit previous tasks inside one’s music catalogue, exploring their potential when translated into totally different genres or types. Lastly, this know-how lowers the entry barrier for creatively inclined people with out formal musical coaching to enter the sector. Anybody can now give you a melody, hum it right into a smartphone microphone, and share outstanding preparations of their concepts with buddies, household, and even try to achieve a wider viewers.

The query of whether or not AI music technology is useful to our societies stays open for debate. Nevertheless, I firmly imagine that melody-conditioned music technology is likely one of the use circumstances of this know-how that genuinely enhances the work of each skilled and aspiring creatives. It provides worth by providing new avenues for exploration. I’m eagerly wanting ahead to witnessing additional developments on this area within the close to future.

MIT-Pillar AI Collective proclaims first seed grant recipients | MIT Information

Organizational Information Democratization: Empowering All Stakeholders