Today, Microsoft published a paper called “NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers“. In this paper, they show a new text-to-speech model which is not only able to copy human speech, but also singing. The model uses a latent diffusion model and neural audio codec to synthesize high-quality, expressive voices with strong zero-shot ability by generating quantized latent vectors conditioned on text input.
With this model, we are reaching a critical point. text-to-speech is now good enough to fool people and replace many jobs and positions that require speech. It also allows for better speech interfaces to language models which makes the interaction more natural from now on. As we are approaching a future where people have personal Ai assistants, natural speech is a core technology. And even though NaturalSpeech 2 is not perfect it is good enough to start this future.
Leave a Reply