Facebook’s Researchers Describe a Technique to Isolate Speech in Recordings of up To Five Individuals

The 2020 ICML (International Conference on Machine Learning) accepted a paper from Facebook’s researchers. In the paper, researchers described a method to separate up to 5 voices speaking simultaneously on one microphone. Facebook’s team claims that their method is better than previous state-of-the-art speech-source separation benchmarks.

Isolating speech from conversations is an important step towards enhancing communication across various apps such as video tools and voice messaging. The problem of background noise suppression can also be resolved by speech-separation methods like those proposed by Facebook’s team.

In this method, the researchers at Facebook used a novel recurrent neural network to create their model. The model proposed by the researchers at Facebook leverages an encoder network, and the encoder network maps raw audio waveforms to a latent representation. Then, these representations are transformed into an estimated audio signal for every speaker by a voice separation network.

To work efficiently, this encoder model requires foreknowledge of the total number of speakers. However, the number of speakers can be automatically detected by a subsystem, and the subsystem can select the speech model accordingly.

In this model, Facebook’s researchers trained distinct models for isolating 2, 3, 4, and 5 speakers. They trained the models by feeding the input mixture to the model that was designed to accommodate up to five speakers at once, so the model would be able to detect the number of audio channels.


Then, researchers repeated the same process with a model that was trained for the number of active speakers. Then, they checked to see if any of the output channels were active, stopping when all the audio channels were active or when the team found the model with the least number of target speakers.

According to researchers at Facebook, their model will be helpful in improving the audio quality for individuals with hearing aids. The system could make it easier for people with hearing aids to hear in a crowded place, or noisy environments like restaurants, and parties. The researchers now plan to prune as well as optimize this model until their model achieves sufficiently high-performance in the real world.

The paper released by Facebook follows a paper published by Google. Google’s publication proposes MixIT (mixture invariant training), an unsupervised approach to isolate and improve the audio of multiple speakers in audio recordings.



Read next: Artificial Intelligence Can Soon Make Pixelated Image Reversal Possible
Previous Post Next Post