Google Deepmind announced Wavenet last year which introduced new techniques in advancing text to speech software. Using 44 hours of audio they were able to develop a neural network for generating a raw natural audio. This new technique is called parametric TTS. However, the team over at Lyrebird developed a way to generate a voice “DNA” key from only one minute of audio. This key allows you to generate audio in real time.
“Different voices share a lot of information,” said Alexandre de Brébisson, Lyrebird co-founder and a PhD student at the Montreal Institute for Learning Algorithms laboratory at the University of Montreal to Scientific American. “After having learned several speakers’ voices, learning a whole new speaker’s voice is much faster. That’s why we don’t need so much data to learn a completely new voice. More data will still definitely help, yet one minute is enough to capture a lot of the voice ‘DNA.’”
In their debut, Lyrebird released several demos of their software in action. This voice demo uses samples from Presidents Trump, Obama and Hillary Clinton.
For comparison, here is a clip of Wavenet uploaded by It’s Only Science
Also when compared to Wavenet, “Lyrebird is significantly faster than WaveNet at generation time,” said de Brébisson. “We can generate thousands of sentences in one second, which is crucial for real-time applications. Lyrebird also adds the possibility of copying a voice very fast and is language-agnostic.”
The possibility of abuse can not be ignored considering the speed and ease of copying someones voice using the software. Lyrebird responds to this issue on their ethics page:
Voice recordings are currently considered as strong pieces of evidence in our societies and in particular in jurisdictions of many countries. Our technology questions the validity of such evidence as it allows to easily manipulate audio recordings. This could potentially have dangerous consequences such as misleading diplomats, fraud and more generally any other problem caused by stealing the identity of someone else.
By releasing our technology publicly and making it available to anyone, we want to ensure that there will be no such risks. We hope that everyone will soon be aware that such technology exists and that copying the voice of someone else is possible. More generally, we want to raise attention about the lack of evidence that audio recordings may represent in the near future.
As for monetization goes, de Brébisson told Techcrunch it won’t “necessarily” be free. “Maybe simple features will, or initial samples will be.” Also, they plan to charge companies and developers based on the number of samples they request.