Microsoft New AI VALL-E that Imitates Voice in 3 Seconds

Spread the love

VALL-E is the name of a new artificial intelligence that is making people’s hair stand on end who are still amazed by how far technology has come and how close it is getting, invention by invention, to being able to do what a person can do.

And the reason is that we’ve already seen AI mimic human behaviors like having deep conversations, doing housework, making photos, texts, and even doing research on historical events. This is partly because more people are becoming aware of how artificial intelligence can learn through repetition, information codes, and patterns of behavior that can be rewarded or punished. This helps to improve the abilities of this technology.

Now, a project has been made in which the voice of a person can be copied after just three seconds of listening to it. This is a new way that artificial intelligence could be used, and it has surprised us a lot.

This project is called VALL-E. It is a language model for text-to-speech synthesis (TTS) that was made by Microsoft. In recent years, the company has put a lot of effort into making this kind of technology better. Also, the idea is that when this artificial intelligence is good enough, it can be combined with the technology of ChatGPT, which is known for being able to create text with basic information and make it seem like you are talking to someone else (even going so far as to write celebrity reviews). music discs). That is, over time, this voice simulator will also be able to simulate a conversation, making the user feel like they are talking to the person whose voice was recorded, even though both stimuli come from artificial intelligence.

One of the most surprising things about VALL-E is that this voice simulator only needs three seconds to listen to the voice of the person it wants to imitate, either in person or through a recording. Microsoft has also said that the artificial intelligence can not only mimic the voice but also the original rhythm of the language and the tone with which the voice sample was recorded. This makes it feel even more like you are talking to someone you know.

What is VALL-E?

VALL-E is able to do all of this with so little information because it can combine tools from other intelligences, such as TTS, speech editing, and GPT-3, which mimics the structure of human speech. This helps you understand the logical order of a speech and the patterns that exist when showing emotions like anger or fatigue in the way you say it.

The model is not yet ready to be used, but there are examples that show how VALL-E can use just three seconds of speech to pick up on how people are feeling and show that in its voice simulation.

According to a VALL-E research article published at Cornell University, “In terms of speech naturalness and speaker resemblance, experiment results reveal that Vall-E beats the state-of-the-art zero-shot TTS system [AI that recreates voices it’s never heard]. Furthermore, we discovered that VALL-E could preserve the speaker’s emotion as well as the acoustic context of the acoustic cue during synthesis.”

https://twitter.com/steventey/status/1612586978201018368

ALSO READ: 10 Coolest Places to Work Remote From for Tech Workers

Spread the love