Voicebox is a new generative AI for speech that can generalize to speech-generation tasks it was not specifically trained to accomplish with state-of-the-art performance. It can create outputs in a vast variety of styles, from scratch or from a sample, and it can modify any part of a given sample. It can also perform tasks such as:

  • In-context text-to-speech synthesis: Using a short audio segment, it can match its style and generate text.
  • Cross-lingual style transfer: Given a sample of speech and a passage of text in six languages, it can produce a reading of the text in that language.
  • Speech denoising and editing: It can resynthesize or replace corrupted segments within audio recordings.
  • Diverse speech sampling: It can generate speech that is more representative of how people talk in the real world.

Voicebox uses a new approach called Flow Matching, which learns from raw audio and transcription without requiring specific training for each task. It also uses a highly effective classifier to distinguish between authentic speech and audio generated with Voicebox. Voicebox outperforms the current state-of-the-art English model VALL-E on zero-shot text-to-speech and cross-lingual style transfer and achieves new state-of-the-art results on word error rate and audio similarity. Voicebox is not publicly available because of the potential risks of misuse, but the researchers have shared audio samples and a research paper detailing the approach and results. They hope to see a similar impact for speech as for other generative AI domains in the future.