Create stems from youtube url - how the f... does this work? (moises.ai)

binz

SS.org Regular
Joined
Jul 20, 2017
Messages
311
Reaction score
235
Location
Barcelona
My vocal teacher showed me this app moises, you can use it with limited functionality for free, and my mind was a bit blown by what it does

https://moises.ai/

With it, you can separate / remove the vocals, drums, guitars and other background instrumentation. How the f... does this work? They say on their website they are using AI, but that can mean anything to nothing. I am working partly with AI but I have no idea what is the state of the art for audio processing. Anybody know how this works more in detail, or has a reference where it is explained?
 
Joined
Dec 2, 2014
Messages
6,824
Reaction score
4,714
Location
... perto de onde a terra acaba e o mar começa...
They must have audio-spectrum deep analyzes and a HUGE database. They are probably associating tone with pitch as a 2 variable function to cut out an instrument. Identify the tone associated with a certain pitch (and vice versa) and isolate it... it's what we do when we try to transcribe something, to focus our listening to a certain guitar, for example...? but I'm just guessing, fairy dust could probably work better...?
 

MoJoToJo

SS.org Regular
Joined
Jun 11, 2013
Messages
329
Reaction score
161
Location
The Black Stump.Australia
I gave it a go tried to separate drums from the guitar on a track I uploaded but still get drums on the guitar track & guitar on the drum track so not perfect.
Maybe it works on specific tracks but not the one I tried.
 

LostTheTone

Elegant Djentleman
Joined
Jan 12, 2021
Messages
1,528
Reaction score
1,395
Location
South east England
I gave it a go tried to separate drums from the guitar on a track I uploaded but still get drums on the guitar track & guitar on the drum track so not perfect.
Maybe it works on specific tracks but not the one I tried.

Yeah that's been my experience with all of these kinds of tools. Not to say that they are bad at all, and in fact they are minor miracles of technology, but equally they never seem to quite live up to what they advertise.

To be completely fair to them; if you imported all the stems and then mixed the track you would probably get pretty good results. It would sound like there was some bleed/crosstalk but some people love that and there are some dedicated tools just to create that effect. It's only when you try to completely remove one source from the mix that you start to see how much is left over in the other tracks.

It's the same thing that you get when you are playing live with amps and cabs, and the vocals running through the PA. You don't notice it at the time because everything is layering up but if you record right off the PA you hear most of the rest of the band. The levels and EQ are a all over the place, but there's way more bleed than you think there is.

Reality is that there is a huge technical challenge in trying to separate out different sound sources from each other. Once the sound is in the air they are all mixing and reverberating and doing weird stuff, and "separating" them is effectively impossible. You can fiddle the EQ to make things more or less audible, but actually isolating them means re-creating information that isn't necessarily in the original.
 
Joined
Dec 2, 2014
Messages
6,824
Reaction score
4,714
Location
... perto de onde a terra acaba e o mar começa...
I gave it a go tried to separate drums from the guitar on a track I uploaded but still get drums on the guitar track & guitar on the drum track so not perfect.
Maybe it works on specific tracks but not the one I tried.

the more dense a mix is, the more difficult it is to separate the tones I guess...?

Y

Reality is that there is a huge technical challenge in trying to separate out different sound sources from each other. Once the sound is in the air they are all mixing and reverberating and doing weird stuff, and "separating" them is effectively impossible. You can fiddle the EQ to make things more or less audible, but actually isolating them means re-creating information that isn't necessarily in the original.

... it's like removing the eggs out of a cake.
 

narad

Progressive metal and politics
Joined
Feb 15, 2009
Messages
12,226
Reaction score
19,190
Location
Tokyo
My vocal teacher showed me this app moises, you can use it with limited functionality for free, and my mind was a bit blown by what it does

https://moises.ai/

With it, you can separate / remove the vocals, drums, guitars and other background instrumentation. How the f... does this work? They say on their website they are using AI, but that can mean anything to nothing. I am working partly with AI but I have no idea what is the state of the art for audio processing. Anybody know how this works more in detail, or has a reference where it is explained?

The most popular strategy (not saying it's what moises is, but it's more what researchers closer to state-of-the-art do) is something called deep clustering. The audio is broken into (sometimes overlapping) windows of audio, and a deep neural network is used to map each window into a latent representation of how the audio streams are organized roughly at that time slice. Then a second network is used predict a mask over the original audio, based on these latent representations. The neural networks are trained end-to-end, so it's just mix as input, individual streams as output (and required for training), and the rest of it is all optimized towards that goal.

I have a paper about it here ;-)
https://github.com/pfnet-research/meta-tasnet

Though it's a funny time to post about it - there was a big competition sponsored by Sony this year, the results announced recently. A researcher at FB extended his work on using wavenets/waveform predicting models to basically re-generate the individual stems, as opposed to masking. Masking has some issues when sounds coincide with high energy sounds, whereas the waveform-predicting method basically does some smart in-filling in those cases. Scores are way up.

https://www.aicrowd.com/challenges/music-demixing-challenge-ismir-2021
 

LostTheTone

Elegant Djentleman
Joined
Jan 12, 2021
Messages
1,528
Reaction score
1,395
Location
South east England
... it's like removing the eggs out of a cake.

Yes, exactly that! Or even worse; perhaps "putting the eggs back in the right shells after you baked the cake".

A researcher at FB extended his work on using wavenets/waveform predicting models to basically re-generate the individual stems, as opposed to masking. Masking has some issues when sounds coincide with high energy sounds, whereas the waveform-predicting method basically does some smart in-filling in those cases. Scores are way up.

It strikes me (as a layman) that the masking approach would be easier to engineer but has an inherent upper limit in terms of what it can achieve. By contrast, the in-filling approach would be a greater technical challenge but could in theory deliver near perfect results. It's a huge challenge to get a system that can detect which data needs to be in filled and then does so in a way that sounds correct to humans, but once you've gotten to that point you can (in theory) recreate each track completely from scratch if you wanted to.

This is also a more interesting technology in general, with broader potential uses for musicians, and is a step towards the mythical "I play on guitar, tab appears on the screen" application that is always on the cusp of possibility.
 

narad

Progressive metal and politics
Joined
Feb 15, 2009
Messages
12,226
Reaction score
19,190
Location
Tokyo
Yes, exactly that! Or even worse; perhaps "putting the eggs back in the right shells after you baked the cake".



It strikes me (as a layman) that the masking approach would be easier to engineer but has an inherent upper limit in terms of what it can achieve. By contrast, the in-filling approach would be a greater technical challenge but could in theory deliver near perfect results. It's a huge challenge to get a system that can detect which data needs to be in filled and then does so in a way that sounds correct to humans, but once you've gotten to that point you can (in theory) recreate each track completely from scratch if you wanted to.

This is also a more interesting technology in general, with broader potential uses for musicians, and is a step towards the mythical "I play on guitar, tab appears on the screen" application that is always on the cusp of possibility.

Of course the in-filling approach has a greater upper-limit since it's a generative process, but in theory you could try to separate some Metallica and a Madonna vocal track could come out (i.e., the model is not constrained in its prediction whereas the mask approach is locking you into slices which must appear in the original audio). But ya, I think if people are trying these apps and not finding them working out due to drum/bass interference, try the FB system:

https://github.com/facebookresearch/demucs
 

LostTheTone

Elegant Djentleman
Joined
Jan 12, 2021
Messages
1,528
Reaction score
1,395
Location
South east England
Of course the in-filling approach has a greater upper-limit since it's a generative process, but in theory you could try to separate some Metallica and a Madonna vocal track could come out (i.e., the model is not constrained in its prediction whereas the mask approach is locking you into slices which must appear in the original audio). But ya, I think if people are trying these apps and not finding them working out due to drum/bass interference, try the FB system:

https://github.com/facebookresearch/demucs

Right - Getting the generation part to work properly (or indeed at all) is a massive challenge. Especially with vocals it'll be really difficult to get outputs that make words, let alone the correct words. But once it is working it's a way more powerful tool overall.

In the end to best way to handle this task is probably doing using both approaches. For de-mixing you obviously want to keep as much of the original as possible; not just a new track of the same notes; so a pure generation approach wouldn't be ideal. So using the masking approach to clean up and give you better audio to start with, then you can fill in any bits that are missing or can't be cleanly extracted. That would keep as much of the original recording as possible, but where it's not possible you still get a full output for each track with much less overlap.
 

narad

Progressive metal and politics
Joined
Feb 15, 2009
Messages
12,226
Reaction score
19,190
Location
Tokyo
Right - Getting the generation part to work properly (or indeed at all) is a massive challenge. Especially with vocals it'll be really difficult to get outputs that make words, let alone the correct words. But once it is working it's a way more powerful tool overall.

It's actually a bit easier with speech since you can multiply the probability of any infilled audio with a language model to score likely continuations of the previous speech.

In the end to best way to handle this task is probably doing using both approaches. For de-mixing you obviously want to keep as much of the original as possible; not just a new track of the same notes; so a pure generation approach wouldn't be ideal. So using the masking approach to clean up and give you better audio to start with, then you can fill in any bits that are missing or can't be cleanly extracted. That would keep as much of the original recording as possible, but where it's not possible you still get a full output for each track with much less overlap.

Yea, we actually did this previously with GANs to refine the meta-tasnet model and it worked well in some instances and not in others. It's something I always wanted to get back to but we stopped taking interns due to covid. I like the idea overall, and it seems easier than completely predicting the stems. But I should check out the new demucs paper since it's clear they're also now working with a separate spectrogram output and maybe they do combine these in a similar way.
 
Joined
Dec 2, 2014
Messages
6,824
Reaction score
4,714
Location
... perto de onde a terra acaba e o mar começa...
... removing the vocal stems out of a mix is, in my opinion, the "easiest" challenge since vocals are generally super high and clear in the mix and voice recognition software is already super advanced... gutural vocals probably won't deliver good results...
 

MoJoToJo

SS.org Regular
Joined
Jun 11, 2013
Messages
329
Reaction score
161
Location
The Black Stump.Australia
Yeah that's been my experience with all of these kinds of tools. Not to say that they are bad at all, and in fact they are minor miracles of technology, but equally they never seem to quite live up to what they advertise.

To be completely fair to them; if you imported all the stems and then mixed the track you would probably get pretty good results. It would sound like there was some bleed/crosstalk but some people love that and there are some dedicated tools just to create that effect. It's only when you try to completely remove one source from the mix that you start to see how much is left over in the other tracks.

Yep I tried a track just one guitar & one drum track that I recorded & mixed down then uploaded to Moises, it separated ok & the guitar track was fine but the drum track was pretty bad with bleed through of guitar & lot of sizzle/artifacts in fact nothing like the original. All good though worth having a fiddle with it.
 
Joined
Dec 2, 2014
Messages
6,824
Reaction score
4,714
Location
... perto de onde a terra acaba e o mar começa...
Yep I tried a track just one guitar & one drum track that I recorded & mixed down then uploaded to Moises, it separated ok & the guitar track was fine but the drum track was pretty bad with bleed through of guitar & lot of sizzle/artifacts in fact nothing like the original. All good though worth having a fiddle with it.

A drum track is composed by lots of different instruments with different tones to them, you should try to separate individual tracks for each drum instrument... or that's what you've done?
 

MoJoToJo

SS.org Regular
Joined
Jun 11, 2013
Messages
329
Reaction score
161
Location
The Black Stump.Australia
A drum track is composed by lots of different instruments with different tones to them, you should try to separate individual tracks for each drum instrument... or that's what you've done?

No I just recorded my short track that was 1 drum track & 1 guitar track mixed then uploaded to Moises & it asked me how many tracks I wanted & I clicked on separate guitar/drum tracks just to see if it would separate guitar from drums.
This is what it punched out >>> First part is original drum track so you can see what I used in the mixdown. The second part is the separated drum track that Moises gave me from my mixed track... The third part is my original guitar track so you can see what I mixed with drums in original track. I just wanted to see if it worked but it didn't, I could have stuffed up. https://k00.fr/1j72m3ck
 
Last edited:


Top