Meta AI has just announced a groundbreaking development in generative AI for speech with the introduction of Voicebox. This new AI model is capable of synthesizing high-quality audio and voice across multiple languages and can even perform various tasks such as noise removal, content editing, style conversion, and sample generation.
Unlike previous speech synthesizers that required specific training for each task, Voicebox is designed to generalize across speech-generation tasks it was not specifically trained for, achieving a new level of quality for generative AI speech. It utilizes a method called Flow Matching, an advancement on non-autoregressive generative models, allowing it to learn from diverse and large-scale raw audio and transcription data.
If it all sounds a bit confusing, that’s because it kind of is! The basic takeaway is that Meta has developed a new model that can hopefully be used for improving AI voices and AI text-to-speech applications. For example, with just a two-second audio sample, Voicebox can match the audio style and generate text-to-speech, opening up possibilities for customizing voices for virtual assistants and aiding those who are unable to speak. Compared to the usual TTS voices found in other apps, this new technology can make the voices sound more realistic than ever.
Voicebox’s ability to learn from varied speech data and generate speech that closely resembles real-world conversations makes it a valuable tool for training speech recognition models. Models trained on Voicebox-generated synthetic speech show similar performance to models trained on real speech, with only a 1% error rate degradation compared to previous text-to-speech models’ 45 to 70% percent degradation with synthetic speech. This means that training new AI voices and text-to-speech tools can be done much easier than previously thought.
Although Meta is not publicly releasing the Voicebox model or code due to concerns about potential misuse, they have shared audio samples and a research paper detailing their approach and results. Meta seems to follow the ethical standards of Google when it comes to developing new AI tech; whilst both of them develop fantastic and innovative AI tech, it’s rarely ever given to the public!
The paper also outlines the development of a highly effective classifier that’s capable of distinguishing between authentic speech and audio generated with Voicebox to mitigate potential risks. You can read the full paper here: (https://ai.facebook.com/blog/voicebox-generative-ai-model-speech/)
By responsibly sharing their research, Meta aims to foster advancements in generative AI for speech and encourage further exploration in this field. Already, we have seen more and more AI text-to-speech tools be released over the past few months, and this leap in technology is bound to make this number increase even more!
Voicebox represents a significant step forward in the field of AI-generated speech, and Meta looks forward to witnessing the impact this will have for generative AI models used for speech applications, similar to their impact on text, image, and video generation in the past.